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Abstract 

This  research  advances  feature  and  model  selection  for  feedforward  neural  networks.  Feature 
selection  involves  determining  a  good  feature  subset  given  a  set  of  candidate  features.  Model 
selection  involves  determining  an  appropriate  architecture  (number  of  middle  nodes)  for  the  neural 
network.  Specific  advances  are  made  in:  neural  network  feature  saliency  metrics  used  for  evaluating 
or  ranking  features,  statistical  identification  of  irrelevant /noisy  features,  and  statistical  investigation 
of  reduced  neural  network  architectures  and  reduced  feature  subsets.  Additionally,  a  comprehensive 
statistically-based  methodology  is  presented  for  feature  and  model  selection. 

New  feature  saliency  metrics  are  presented  which  provide  a  more  succinct  quantitative  mea¬ 
sure  of  a  feature’s  importance  than  other  similar  metrics.  A  catalogue  of  feature  saliency  metric 
definitions  and  interrelationships  is  developed  which  consolidates  the  set  of  available  metrics  for 
the  neural  network  practitioner. 

A  statistical  screening  procedure  for  identifying  noisy  features  is  presented.  The  procedure 
involves  statistically  comparing  the  saliency  of  candidate  features  with  the  saliency  of  a  known 
noisy  feature.  Noisy  features  are  successfully  id«itified  over  a  series  of  test  problems  using  the  new 
saliency  screening  procedure. 

Two  novel  neural  network  selection  algorithms  are  developed  by  posing  the  neural  network 
model  as  a  nonlinear  regression  statistical  model.  The  first  is  an  architecture  selection  algorithm 
and  the  second  is  a  feature  selection  algorithm.  The  feature  selection  algorithm  is  unique  because 
architecture  reduction  is  investigated  as  featmes  are  removed.  Both  algorithms  use  the  likelihood 
ratio  test  statistic  within  a  backwards  sequential  procedure.  Application  results  demonstrate  how 
these  algorithms  can  be  used  to  search  for  a  more  parsimonious  neural  network  model  with  equiv¬ 
alent  prediction  accuracy. 

A  comprehensive  neural  network  selection  methodology  is  developed  for  identifying  both 
a  good  feature  set  and  an  appropriate  neural  network  architecture  for  a  specific  situation.  It 
encompasses  a  combination  of  the  statistical  screening  and  the  statistical  architecture  and  feature 
selection  procedures.  Application  results  demonstrate  the  utility  of  the  methodology. 
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Feature  and  Model  Selection 


in 

Feedforward  Neural  Networks 

1.  Introduction 

This  research  advances  feature  and  model  selection  for  feedforward  neural  networks.  Feature 
selection  involves  determining  a  good  feature  subset  from  a  set  of  candidate  features,  and  model  se¬ 
lection  involves  determining  an  appropriate  neural  network  architecture  (number  of  middle  nodes). 

Generally  speaking,  feedforward  neural  networks  are  used  as  either  regression  functions  or 
discriminant  fimctions.  For  both  linear  and  nonlinear  regression  analysis,  as  well  as  discriminant 
analysis,  there  is  statistical  theory  formalizing  the  process  of  feature  and  model  selection.  Until 
this  research,  statistical  theory  has  not  been  practically  used  to  formalize  the  selection  process  for 
feedforw2ird  neural  networks. 

These  research  results  contribute  to  theory  formalizing  feature  and  model  selection  for  feed¬ 
forward  neural  networks.  Specific  advances  are  made  in:  neural  network  feature  saliency  metrics 
used  for  evaluating  or  ranking  features,  statistical  identification  of  irrelevant /noisy  features,  and 
statistical  investigation  of  reduced  neural  network  architectures  and  reduced  feature  subsets.  Ad¬ 
ditionally,  a  comprehensive  statistically-based  methodology  is  presented  for  feature  and  model 
selection.  The  remainder  of  this  chapter  provides  background  on  feature  selection  and  feedforward 
neural  networks,  and  a  preview  of  the  dissertation. 

1.1  Feature  Selection 

Feature  selection  as  considered  in  this  research  involves  determining  a  subset  of  candidate 
features  specifically  in  the  context  of  estimating  a  sufficiently  accurate  neural  network  prediction 
function.  Figure  1  shows  how  feature  selection  fits  into  an  overall  prediction  process.  In  this  section, 
the  term  ‘prediction  function’  is  general  and  is  also  used  to  refer  to  classification  functions. 
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There  are  many  reasons  for  using  feature  selection  techniques  to  reduce  the  number  of  features. 
Reasons  for  using  feature  selection  techniques  include: 

•  satisfying  the  general  goals  of  maximising  the  accuracy  of  the  prediction  function  while  min¬ 
imising  the  associated  measurement  costs 

•  improving  prediction  accuracy  by  reducing  irrelevant  and  possibly  redundant  features 

•  reducing  the  complexity  and  the  associated  computational  costs  of  a  prediction  function 

•  reduce  the  amount  of  data  needed  for  accurate  prediction  (i.e.  reduce  the  ‘curse  of  dimen¬ 
sionality’  [14:487]). 

•  reducing  associated  data  collection  and  data  processing  cost 

•  improving  the  chances  that  a  solution  will  be  both  understandable  and  practical 

•  improving  the  possibility  of  graphiced  representation  of  the  data 

Generally,  a  prediction  function  is  finely  tuned  to  the  finite  amount  of  available  training  data. 
When  “feature  spaces”  are  plagued  with  irrelevant  or  redimdant  features,  a  prediction  function 
may  not  generalize  well  for  predicting  unknown  data,  particularly  if  there  is  insufficient  training 
data.  A  reduction  in  the  number  of  features  may  degrade  prediction  accuracy  on  the  training  data, 
but  it  also  reduces  the  amount  of  data  required  for  good  generalization.  Foley  recommends  the 
ratio  of  training  vectors  in  a  class  to  the  dimensionality  of  the  feature  space  should  be  greater 
than  three  to  ensure  that  the  error  rate  on  held  out  data  is  close  to  the  true  error  rate  [17:623]. 
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Figure  1.  Prediction  Process  for  Regression  and  Discrimination  Problems 
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His  results  are  based  on  empirical  results  for  a  two-class  discrimination  problem  with  multivariate 
normal  distributions  for  the  input  features. 

In  this  research  a  formalized  feature  selection  process  is  characterized  by  three  components. 
The  first  component  is  a  metric  or  criterion  function  for  evaluating  and  ranking  the  features  (or 
feature  subsets).  The  second  component  is  a  set  of  screening  procedures  for  identifying  irrelevant 
and  redundant  features.  The  third  component  is  a  search  methodology  for  examining  possible 
feature  subsets.  The  results  of  this  research  provide  advances  to  all  three  components  of  the 
feature  selection  process  in  the  context  of  feedforward  neural  networks. 

1.2  Feedforward  Neurcd  Networks 

Feedforward  networks  are  genersJly  used  in  two  types  of  applicatiops:  regression  analysis 
and  discriminant  analysis.  For  regression  applications,  the  network  is  used  1,0  estimate  a  linear  or 
nonlinear  function  for  prediction.  For  discriminant  analysis  applications,  the  network  is  used  to 
estimate  a  linear  or  nonlinear  discriminant  function  for  classification.  Covered  in  the  remainder  of 
this  section  are: 

•  an  overview  of  feedforward  neural  networks 

•  the  b2ickpropagation  algorithm 

•  the  neural  network  approximation  to  the  Bayesian  optimal  discriminant  function 

•  confidence  interval  estimation  techniques 

1.2.1  Feedforward  Neural  Networks  Overview.  Feedforward  neural  networks,  often  referred 
to  as  multilayer  perceptrons,  generally  have  a  feature  input  layer,  one  or  more  hidden  layers,  and  a 
function  output  layer.  The  neural  network  shown  in  Figure  2  illustrates  the  structure  and  notation 
associated  with  a  single  hidden  layer  feedforward  neural  network.  The  notation  will  be  defined  in 
Section  1.2.2. 

The  feature  input  layer  consists  of  normalized  feature  input  data.  The  feature  input  data 
can  be  the  raw  data  or  an  appropriate  '  ^ansformation  (or  projection)  of  the  raw  data.  Typical 
normalization  of  the  feature  inputs  consists  of  either  a  simple  transformation  so  that  all  features 
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Figure  2.  Single  Hidden  Layer  Feedforward  Neural  Network 
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have  the  same  range,  say  between  -1  and  1,  or  between  0  and  1,  or  a  statistical  normalisation 
which  standardises  each  feature  to  null  mean  and  unit  variance  [41:50]  [60:16]  [76:100]. 

Depending  on  the  application,  the  nodes  on  the  hidden  (middle)  layers  either  have  linear  or 
nonlinear  activation  functions  f{a).  For  example,  linear  activation  functions,  where  /(a)  =  a,  could 
be  used  in  linear  regression.  The  sigmoidal  activation  function  is  commonly  used  as  a  nonlinear 
activation  function  since  its  derivative  is  continuous  and  makes  the  weight  update  rule  simple  for 
backpropagation  training.  The  sigmoidal  activation  function  and  its  derivative  are  defined  as: 


/(a) 

m^)) 

dz 


1 


1  +e— 

/Wli  - 


A  single  hidden  layer  network  with  sigmoidal  squashing  functions  on  the  hidden  layer  is  used 
in  this  research.  A  single  hidden  layer  configuration  is  common  because  Cybenko,  and  Homik  and 
others’,  show  that  this  type  of  network  (with  linear  output  nodes)  is  capable  of  arbitrarily  accurate 
approximations  for  any  arbitrary  function  provided  a  sufficient  number  of  hidden  nodes  are  used 
with  either  sigmoidal  activation  functions  [12],  or  appropriately  smooth  activation  functions  [27]. 
Although  the  number  of  required  hidden  nodes  is  imknown  in  advance,  a  reasonable  number  of 
middle  nodes  is  often  determined  by  a  trial  2md  error  process  or  by  more  sophisticated  methods 
[9,  13,  25,  28,  31,  32,  37,  44,  52,  64,  68,  79,  83].  A  by-product  of  the  feature  selection  research 
done  in  this  dissertation  is  a  novel  architecture  selection  algorithm  presented  in  Chapter  V  for 
investigating  the  appropriate  number  of  middle  nodes.  The  algorithm  is  unique  because  it  is  based 
on  a  nonlinear  statistical  model  selection  criterion. 

The  function  output  layer  consists  of  one  or  more  nodes  with  linear  or  nonlinear  activation 
functions  depending  on  the  application.  Linear  output  activations  are  generally  used  for  function 
approximation.  Nonlinear  sigmoidal  output  activation  functions  are  generally  used  for  discriminant 
fimction  applications.  In  this  research,  the  output  layer  has  sigmoid  activation  functions  since 
discrimination  function  applications  are  used. 

1.2.2  Backpropagation  Training  Algorithm.  Backpropagation  is  the  most  popular  algorithm 
for  finding  a  feedforward  neural  network’s  weight  parameters.  The  backpropagation  algorithm  was 
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first  developed  by  Werbos  [77].  Later,  it  was  rediscovered  independently  by  Parker  and  then 
reformulated  by  Rumelhart,  Hinton  and  Williams  with  reference  to  prior  work  by  Parker [48,  63]. 
A  good  overview  of  the  backpropagation  training  algorithm  is  in  Lippmann  [40,  41]. 

Backpropagation  is  an  iterative  gradient  descent  algorithm  requiring  sample  problem  data. 
It  involves  minimising  the  error  between  the  actual  and  desired  outputs  of  the  network  in  order  to 
estimate  a  neural  network’s  optimal  weight  parameters.  In  the  backpropagation  algorithm  described 
herein,  the  weights  are  “instantaneously”  updated  after  the  presentation  of  each  input  vector.  In 
another  version  of  backpropagation,  batch  backpropagation,  the  weights  are  only  updated  after  the 
error  gradient  has  been  aggregated  for  one  full  presentation  or  “epoch”  of  the  training  data.  The 
instantaneous  back  propagation  algorithm  followed  by  additional  details  is  presented  next. 

The  Instsuitaneoua  Backpropagation  Algorithm 

for  a 

Single  Hidden  Layer  Feedforward  Neural  Network 

1.  Randomly  partition  data  into  training,  training-teat,  and  validation  sets. 

2.  Normalise  the  feature  input  data. 

3.  Initialise  weights  to  small  random  values. 

4.  Present  the  network  with  a  randomly  selected  vector  from  the  training  set,  denoted  x**. 

5.  Calculate  the  network  output  s'  associated  with  the  pth  training  vector. 

•  kth.  neural  network  output:  sj  =  where 

-  H  is  the  number  of  middle  nodes 

-  /(a)  =  1/(1  +  e““)  for  sigmoidal  activation  functions 

-  /(a)  =  a  for  linear  activation  functions 

-  is  the  weight  from  middle  node  j  to  output  node  k 

-  Xq  is  the  middle  layer  bias  term  and  is  set  equal  to  1 

-*!  =  /(  **  output  of  middle  node  j 

-  M  is  the  number  of  feature  inputs 

-  is  the  weight  from  input  node  i  to  middle  node  j 

-  Zq  is  the  input  layer  bias  term,  and  is  equal  to  1 

-  zf  is  the  sth  feature  input 

6.  Update  the  weights. 

•  upper  layer  weights: 


•  lower  layer  weights:  (»/,)■*■  =  (wy)"  +  where 

-  (wjk)'*'  is  the  updated  weight  from  middle  node  j  to  output  i 

-  u  weight  from  from  middle  node  j  to  output  k 

-  is  the  updated  weij^t  from  input  t  to  middle  node  j 

-  is  the  old  weight  from  from  input  t  to  middle  node  j 

-  fl  is  the  step  siie 

~  ~  ~  ^)  if  there  is  a  sigmoid  on  the  output 

-  =  (dj[  -  zj^)  if  the  output  is  linear 

-  S}  =  zl(l  -  xj)  J2iLi  if  there  is  a  sigmoid  on  middle  node  j 

-  =  Y,k=i  ^k(^jk)~  >  if  middle  node  j  is  linear 

-  is  the  hth  desired  output  of  the  pth  exemplar 

7.  If  training-test  set  error  does  not  indicate  sufficient  convergence,  go  to  step  4. 


In  Step  1,  the  problem  data  is  randomly  divided  into  two  sets:  a  training  data  set  and  a 
validation  data  set  [23:116-117].  The  training  data  set  is  further  subdivided  into  a  training  set  and  a 
training-teat  set.  The  training  set  is  used  to  estimate  the  weight  parameters,  and  the  training-teat  set 
is  used  to  evaluate  the  backpropagation  learning  process  by  measuring  the  network’s  performance 
on  unknown  data.  The  validation  data  set  is  a  set  of  data  which  has  not  been  used  in  nay  way  to 
determine  the  prediction  function.  The  validation  data  set  is  used  to  evaluate  a  neural  network’s 
capability  to  adequately  generalize  to  future  data.  Some  guidelines  on  determining  these  data  sets 
are  given  in  [23:116-119]  [76:28-39]. 

Data  normalization  schemes  which  can  be  used  for  feature  input  normalization  in  Step  2  are 
discussed  in  Section  1.2.1.  In  this  research,  the  validation  data  set  is  normalized  separately  from 
the  training  data.  This  keeps  the  normalization  information  for  the  training  and  validation  data 
sets  separate. 

In  Step  3,  weight  parameters  are  usually  initialized  to  small  random  numbers  between  -.5 
and  .5  [55:56].  Step  4  of  the  backpropagation  algorithm  is  characterized  by  feeding  a  randomly 
selected  feature  input  vector  x'  into  the  neural  network,  where  p  indicates  that  x  is  the  pth  vector 
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in  the  training  set.  Step  5  involves  calctilating  the  vector  of  network  outputs  in  a  feedforward 
manner  via  the  summations  and  sigmoids  defined  by  the  network’s  structure. 

In  Step  6,  the  instantaneous  network  output  error  associated  with  x*  is  calculated  using 
the  pth  vector  of  neural  network  outputs  %*  and  the  corresponding  vector  of  desired  outputs  d'. 
Instantaneous  network  output  error  E*  is  the  squared  error  associated  with  the  pth  exemplar  and 
is  given  as: 

(1) 

k=l 

where  K  is  the  number  of  output  nodes,  is  the  desired  output  associated  with  the  pth  exemplar 
and  kth  output,  and  is  the  network  output  with  the  pth  exemplar  and  the  kth  output.  The 
gradient  descent  step  direction  is  determined  by  taking  the  partial  derivative  of  with  respect  to 
the  weight  parameters.  A  derivation  of  the  gradient  descent  step  direction  used  in  Step  6  is  given 
by  Rogers  and  others  [55]. 

The  step  size,  rj,  can  be  constant  or  variable.  White  makes  the  point  that  a  constant  learning 
rate  is  inefficient  because  the  random  influences  in  the  input  will  result  in  random  fluctuations  in  the 
weight  vector  preventing  backpropagation  from  ever  settling  down  to  the  optimal  weight  vector  [80]. 
A  declining  learning  rate  (eventually  declining  to  zero)  is  minimally  required  for  backpropagation 
to  settle  down  [80].  White  suggests  declining  learning  rates  which  are  inversely  proportional  to  the 
number  of  epochs  or  the  log  of  the  number  of  epochs  [80].  Three  potential  declining  learning  rates 
are  defined  below  in  terms  of:  the  total  number  of  epochs  We,  the  current  epoch  L,  and  the  starting 
value  a  for  a  linearly  declining  learning  rate.  The  drawback  of  the  linearly  declining  learning  rates 
is  that  the  maximum  number  of  training  epochs  is  used,  which  is  not  generally  known  in  advance. 


Log  Declining  Rate  Vl 


1 

/n(l  +  i) 


Linearly  Declining  Rate  i/i 


a 


Ne  +  lJ 
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Log-Linearly  Declining  Rate  Vx, 


_  ife+i) 

ln{l  +  L) 


Two  types  of  error  rates  are  associated  with  backpropagation  training;  output  error  and 
classification  error.  Output  error  is  measured  as  a  function  of  the  approximation  error  between 
the  vector  of  network  outputs  a  and  the  vector  of  desired  or  true  outputs  d.  Typically,  the  output 
error  is  measured  as  the  average  squared  network  error.  For  a  training  set  of  P  vectors,  the  output 
error,  denoted  £o,  is  defined 

=  (2) 

p=lk=l 

where  P  is  the  number  of  exemplars  in  the  training  set,  K  is  the  number  of  output  nodes,  is  the 
desired  output  associated  with  the  pth  exemplar  and  the  kth  output,  and  3^  is  the  network  output 
with  the  pth  exemplar  and  the  kth  output. 

Classification  error  is  used  to  measure  the  percentage  of  vectors  which  are  incorrectly  classified 
in  a  data  set.  This  type  of  error  is  applicable  only  when  neural  networks  are  used  for  discriminant 
analysis  or  pattern  classification  problems.  Define  /*,  a  Bernoulli  random  variable  as: 


if  c'  is  incorrectly  classified 
otherwise 


where  is  the  pth  exemplar.  The  classification  error,  denoted  £e,  is  then  defined  as  the  average 
of  P  Bernoulli  random  variables  P: 

e,  =  p-''£i' 


The  random  partitioning  of  the  data  in  Step  1  is  often  referred  to  as  the  hold-out  method  when 
discussed  in  the  context  of  error  estimation.  The  hold-out  method  gives  a  conservative  estimate  of 
the  average  network  error  for  two  reasons.  One,  it  does  not  use  ail  available  data  while  training  the 
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classifier  [14:355].  Two,  it  uses  an  independent  validation  data  set,  rather  than  the  training-test 
set,  to  measure  the  averi^e  network  error. 

Minimum  output  error  (and  minimum  classification  error  if  appropriate)  on  the  iraining-teai 
set  is  a  good  indicator  of  sufficient  convergence  or  good  ‘generalisation  capability.’  In  Step  7,  the 
algorithm  terminates  if  the  error  rate  has  converged  sufficiently  or  the  algorithm  continues  back  to 
Step  4  to  get  a  new  training  vector. 

Over-trsdning  is  an  undesirable  phenomena  which  sometimes  occurs  with  backpropagation 
training.  This  phenomena  has  occurred  if  the  iraining-teat  set  error  begins  to  increase  while  the 
training  aet  error  continues  to  decrease.  Over- training  indicates  the  training  set  has  been  memorized 
at  the  expense  of  the  network’s  capability  to  predict  (generalize  to)  the  training-teat  set.  In  this 
research,  over-training  was  observed  in  cases  of  small  training  sets  where  u  neural  network  with 
more  than  enough  middle  nodes  were  used.  The  combination  of  a  small  training  data  set  with  a 
large  number  of  middle  nodes  gives  the  network  the  capability  to  memorize  or  to  over-generalize 
to  the  training  set. 

1.2.3  Approximation  to  the  Bayeaian  Optimal  Diacriminant  Function.  In  this  section,  the 
neural  network  approximation  to  the  Bayes  optimal  discriminant  is  discussed.  The  Bayes  optimal 
discriminant  function  minimizes  the  probability  of  error.  It  can  be  defined  for  the  kth  class  of  a 
multi-class  problem  using  the  posterior  probability  of  x  belonging  to  class  k  [61].  The  vector  x  is 
classified  as  belonging  to  class  k  if  the  largest  discriminant  function  value  is  from  the  kth.  class. 

Several  researchers  have  proven  that  a  neural  network  approximates  a  Bayes  optimal  discrim¬ 
inant  under  certain  conditions  [30,  54,  61,  68,  75]  [41:50].  These  conditions  are: 

•  The  neural  network  is  trained  to  outputs  of  0  and  1. 

•  The  training  set  data  are  random  variables. 

•  The  network  is  trained  to  a  minimum  mean  square  error  measure. 
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•  The  ttaming  set  class  membetship  percentages  reflect  the  real  world. 

When  these  conditions  are  met  the  neural  network  outputs  Zk  can  be  interpreted  as  approximations 
to  the  posterior  probability  for  class  k.  The  quality  of  these  approximations  is  affected  by: 

•  Neural  network  complexity  (i.e.  number  of  middle  nodes) 

•  Amount  of  traiuing  data 

•  Convergence  of  the  neural  network  to  a  solution 

1.2.4  Confidence  Interval  Estimation.  Confidence  intervals  can  be  used  to  assess  neural 
network  error  rates.  The  performance  over  several  runs  of  a  neural  network  can  be  characterized 
as  an  average  error  rate.  A  confidence  interval  for  the  average  error  provides  information  as  to 
the  point  statistics  variability.  Generally,  more  observations  on  a  random  variable  will  reduce  the 
corresponding  confidence  interval  length. 

The  error  rate  of  a  neural  network  can  be  considered  an  independent  random  variable  depen¬ 
dent  on:  the  random  order  of  the  training  data  set,  the  random  starting  point  used  to  determine 
the  weight  parameters,  and  the  selected  termination  point  of  the  neural  network  training.  For 
reasonably  large  number  of  training  vectors  P,  the  standard  normal  distribution  or  the  t  distribu¬ 
tion  can  be  used  for  confidence  interval  estimation,  depending  on  whether  the  variance  is  known 
or  must  be  estimated.  These  distributions  are  appropriate  for  large  P,  since  the  error  is  inde¬ 
pendent  and  approximately  normally  distributed  by  the  central  limit  theorem  [47:6].  For  small 
data  sets,  confidence  intervals  for  proportions  may  also  be  appropriate  for  classification  error  rates 
[24:252-254]. 

Let  Ei,i  =  1,  •  •  •,  JV  be  a  random  sample  of  N  observations  of  the  neural  network  enat  rate 
£  which  are  assumed  to  be  normally  distributed.  The  mean  and  variance  of  £  are  defined  as: 

E{£}  =  M 
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var{f}  =  <r* 


where  E{-}  is  the  expectation  operator  and  var{-}  is  the  variance  operator.  The  corresponding 
unbiased  and  consistent  estimators  for  E{£}  and  var{f}  are  B  and  s’,  respectively,  which  are 
defined 


B 


i=l 

»=i 


The  mean  error  B  is  also  approximately  normally  distributed  with  an  expected  value  of  fi  and 
variance  of  ^  [47:7].  That  is: 


e{B} 

var{^} 


N 


The  corresponding  imbiased  and  consistent  estimators  for  E{f }  and  var{f}  are  6  and  respec¬ 
tively. 

When  O’’  is  known,  the  standard  normal  distribution  can  be  used  to  form  confidence  intervals 
for  the  expected  value  of  B.  In  practice,  however,  the  true  variance  is  unknown  and  must  be 
estimated  with  s’ .  Therefore,  the  t  distribution  is  most  appropriate  for  forming  confidence  intervals 
for  E{f }  with  the  statistic 

B-l* 

Tn 

which  is  distributed  as  a  t  distribution  with  N  —  \  degrees  of  freedom.  Confidence  intervals  for 
E{^}  using  the  t  distribution  look  like: 


y/N' 
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where  is  determined  by  the  t  distribution  for  a  given  confidence  coefficient  1  -  a  and 

degrees  of  freedom  N  -  1.  One  can  be  100(1  -  a)  percent  confident  that  the  absolute  error  in 
estimating  the  E{^}  is  less  than  the  confidence  interval  half  width  of 

When  computing  confidence  intervals  for  neural  networks,  one  factor  to  consider  is  that 
backpropagation  learning  may  or  may  not  converge  to  a  local  minima  [80:143].  Usually,  one  does 
not  want  to  corrupt  neural  network  point  statistics  and  confidence  intervals  by  including  a  network 
which  has  not  converged.  According  to  White,  it  makes  sense  to  train  a  number  of  neural  networks 
and  select  the  network  which  minimizes  network  error  [80:143].  Although  this  type  of  methodology 
does  not  guarantee  being  close  to  a  global  minima,  it  usually  yields  estimated  network  parameters 
which  are  consistent  for  a  local  minima  [80:143]. 

In  this  research,  the  inconsistency  of  backpropagation  training  is  taken  into  consideration. 
An  attempt  is  made  to  only  use  neural  network  results  which  correspond  to  networks  which  are 
trained  to  good  local  minima.  Fbr  a  network  to  be  considered  ‘trained,’  the  network  is  required 
to  attain  a  predetermined  (for  the  problem  at  hand)  maximum  error  rate  on  the  training  set. 
When  a  re2isonable  error  rate  is  not  attained,  it  is  assumed  that  the  network  has  not  converged 
to  a  good  local  minimum,  and  the  results  from  this  network  are  not  used.  In  some  cases,  it  is 
impractical  to  enforce  a  reasonable  error  rate.  In  these  cases,  only  a  subset  of  the  neural  networks 
are  used  to  compute  the  statistics  for  network  error.  Each  network  in  the  subset  represents  the 
best  network  from  a  sub-experiment  where  only  the  best  network  is  kept  from  a  number  of  trained 
neural  networks. 

J.3  Preview 

The  remainder  of  this  dissertation  is  organized  as  follows:  Chapter  11  provides  background 
on  the  feature  selection  techniques  associated  with  regression,  discriminant  analysis,  and  neural 
networks.  In  Chapter  IQ,  novel  feature  metrics  for  evaluating  and  ranking  candidate  features  are 
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defined  and  evaluated,  and  theoretical  relationships  among  the  set  of  available  feature  metrics  are 
documented.  A  technique  for  statistically  identifying  noisy  features  is  presented  in  Chapter  IV. 
In  Chapter  V,  selection  algorithms  are  developed  using  a  nonlinear  regression  statistical  model 
building  perspective  for  both  architecture  determination  and  feature  input  selection  in  neural  net¬ 
works.  Then,  in  Chapter  VI,  a  comprehensive  neural  network  selection  methodology  is  developed 
for  identifying  both  a  good  feature  set  and  an  appropriate  neural  network  architecture  for  a  spe¬ 
cific  situation.  The  research  is  summarized  and  future  research  recommendations  are  made  in 
Chapter  VII. 
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II.  Background  on  Feature  Selection  Techniques 


2.1  Introduction 

This  chapter  provides  a  complete  review  of  feature  selection  techniques  developed  for  regres¬ 
sion,  discriminant  analysis,  and  neural  networks.  The  feature  selection  techniques  developed  for 
regression  and  discriminant  analysis,  as  well  as  those  developed  for  neural  networks,  are  relevant 
background  material  since  neural  network  applications  include  problems  which  have  often  been 
solved  using  more  classical  regression  and  discriminant  analysis  techniques.  Feature  selection  tech¬ 
niques  formally  developed  for  classical  regression  and  discriminant  analysis  may  also  be  potentially 
useful  in  a  neural  network  context. 

In  Section  2,  some  common  feature  (variable)  selection  criteria  and  methodologies  for  linear 
regression  are  covered.  Statistical  variable  selection  criteria  for  univariate  nonlinear  regression 
are  covered  in  Section  3.  In  Section  4,  feature  evaluation  criteria  and  selection  methodologies  are 
reviewed  for  discriminant  analysis.  The  feature  evaluation  metrics  and  selection  methods  developed 
specifically  for  neural  networks  are  discussed  in  Section  5.  In  Section  6,  the  chapter  is  summarized. 

2.2  Linear  Regression 

There  are  several  good  references  which  survey  aspects  of  the  variable  (feature)  selection 
problem  for  linear  regression  [26,  43,  47,  49].  In  the  linear  regression  literature  features  are  gener¬ 
ally  referred  to  as  predictor  variables  or  just  variables.  In  this  section,  univariate  response  linear 
regression  is  reviewed  including  standard  variable  evaluation  criteria  and  selection  procedures.  An 
extension  to  the  multivariate  response  ceise  can  be  found  in  Chapter  8  of  Anderson  [3]  and  Chapter 
15  of  Krzanowski  [35]. 

2.2.1  Univariate  Linear  Egression  Overview.  Univariate  response  linear  regression  is  a 
technique  which  describes  the  statistical  relationship  between  a  response  variable  and  a  set  of 
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fixed  predictor  variables.  A  statistical  relationship,  in  contrast  to  a  functional  relationship,  is  not 
exact.  Statistical  relationships  are  characterised  by  the  tendency  for  the  response  variable  to  change 
systematically  with  the  set  of  fixed  predictor  (regressor)  variables,  and  the  tendency  for  points  to 
scatter  around  the  curve  of  statistical  relationship.  Neter,  Wasserman,  and  Kutner  [47:27]  describe 
how  these  characteristics  are  embodied  in  a  regression  model  by  postulating: 

•  There  is  a  probability  distribution  of  the  response  variable  for  each  level  of 
the  predictor  variables. 

•  The  means  of  these  probability  distributions  of  the  response  variable  vary  in 
some  systematic  fashion  with  the  predictor  variables. 

A  discussion  of  linear  regression  assuming  randomly  distributed  predictor  variables  with  a  multi¬ 
variate  normal  distribution  can  be  found  in  both  Thompson  [73]  and  Anderson  [3]. 

Consider  the  linear  regression  model 

y  =  Xp  +  e  e~N„(0,(rX)  (3) 

where,  y  is  an  n-dimensional  vector  of  responses,  and  X  is  an  n  by  jk  matrix  with  ib  -  1  columns 
of  fixed  independent  predictor  variables  and  a  column  of  I’s  for  the  constant  or  bias  term.  Also, 
^  is  a  ib-dimensional  vector  of  unknown  variable  coefficients.  The  predicted  or  fitted  vector  of  y, 
denoted  y,  is  defined 

y  =  Xb,  (4) 

where  b  is  a  k-dimensional  vector  of  estimated  parameters  for  /9. 

The  method  of  least  squares  is  used  more  extensively  than  any  other  estimation  procedure  for 
determining  regression  model  coefficients  [46:12].  This  method  requires  minimization  of  the  sum  of 
squared  errors,  SSE,  given  as 

ssE  =  (y-my-y) 


16 


now  substituting  for  y  using  Equation  4  gives 

SSE  =  (y  -  Xby(y  -  Xb) 

To  find  the  least  squares  estimator,  partial  derivatives  of  SSE  are  taken  with  respect  to  b  and  set 
equal  to  0,  defining  a  set  of  normal  equations.  In  matrix  terms,  these  normal  equations  are: 

(X'X)b  =  (X'y).  (5) 

where  X'X  is  a  k-dimensional  matrix  and  X'y  is  a  k-dimensional  vector.  Now,  assuming  (X'X)~^ 
exists,  Equation  5  is  solved  for  b  using  matrix  algebra  giving  the  least  squares  estimate: 

b  =  (X'X)-‘(X'y)  (6) 

Regardless  of  the  distribution  of  the  errors  e,  the  method  of  least  squares  provides  unbiased  point 
estimators  with  minimum  variance  among  all  unbiased  linear  estimators  [47:52].  However,  for 
confidence  intervals  and  most  statistical  h3rpothesis  tests,  it  is  necessary  to  assume  that  the  errors 
are  independently  distributed  as  given  in  Equation  3. 

In  neural  network  terms,  the  vector  y  is  the  n  X  1  vector  of  observations  of  desired  outputs 
whereas  the  vector  y  is  the  n  x  1  vector  of  actual  outputs  or  trained  neural  network  outputs.  The 
matrix  X  is  the  data  matrix  of  measured  feature  vectors,  including  the  bias  term  which  is  equal 
to  one  for  each  feature  vector.  The  vector  /9  is  somewhat  analogous  to  the  vector  of  unknown 
optimal  neural  network  weight  parameters,  and  the  vector  of  estimated  coefficients  b  is  somewhat 
analogous  to  the  vector  of  trained  or  estimated  neural  network  weight  parameters.  Also,  SSE  is 
the  squared  error  function  which  is  minimised  in  the  standard  backpropagation  algorithm. 

2.2.2  Selection  Criteria.  In  this  section,  six  of  the  common  variable  selection  criteria  used 
with  linear  regression  are  reviewed.  The  first  two,  and  R\,  are  measures  of  the  proportionate 
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reduction  in  the  total  variation  of  y  associated  with  a  subset  of  p  predictor  variables.  These  metrics 
can  be  used  to  evaluate  the  quality  of  a  regression  model’s  fit  to  the  present  data.  The  third  and 
fourth  criterion,  and  Press,,  are  measures  of  a  regression  model’s  predictive  error  for  a  subset  of 
p  predictor  variables.  These  criteria  are  computed  using  a  validation  data  set  which  is  independent 
of  the  data  set  used  to  estimate  the  regression  parameters.  (7,  and  Press,  are  used  to  assess  the 
quality  of  future  prediction  for  a  subset  of  candidate  variables.  The  last  two  selection  criteria  are 
the  Akaike  and  Swartz  information  criteria.  These  criteria  are  based  on  maximizing  the  theoretic 
information  content  of  a  variable  subset. 

The  coefficient  of  multiple  determination,  measures  the  proportionate  reduction  of  total 
variation  in  y  associated  with  a  particular  set  of  p  predictor  variables.  The  total  sums  of  squares 
SSTO  which  is  constant  regardless  of  which  predictors  are  used,  the  regression  sums  of  squares  for 
p  predictor  variables  SSR,,  and  the  sum  of  squared  errors  for  p  predictor  variables  SSE,  are  all 
needed  to  define  Rp. 


SSTO  =  (y-y)'(y-y) 

SSR,  =  (y-yy(y-y) 

(7) 

SSE,  =  (y-^y-y) 

(8) 

Now,  the  coefficient  cf  multiple  determination  cem  be  defined  as 

2  _  SSR,  _  SSE, 

^  SSTO  SSTO 

where  the  subscript  p  indicates  that  only  p  predictor  variables  have  been  used  from  a  ‘superset’  of 
q  candidate  predictors,  where  p<  q. 

For  classical  least  squares  linear  regression,  presented  in  Section  2.2.1,  SSTO  =  SSR,  +  SSE,. 
The  criterion  does  not  correct  for  the  number  of  variables  in  the  model;  therefore,  it  is  maximized 
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when  all  q  predictor  variables  are  used.  The  criterion  is  useful  for  comparing  several  subsets  of 
equal  sise.  The  subset  with  the  largest  value  of  is  the  subset  which  is  associated  with  reducing 
the  largest  proportion  of  the  total  variation  in  y.  When  subsets  are  not  of  equal  sise,  can  be 
subjectively  inspected  to  identify  variables  which  do  not  substantially  increase  in  R^. 

The  R^  criterion  is  similar  to  the  R^  criterion,  except  that  it  is  a4iusted  for  the  number  of 
variables  in  the  regression  model.  It  is  defined  as 


Using  the  fact  that  mean  squared  error  for  p  predictor  variables  MSEp  is  defined  as 


MSEp  = 


SSEp 

n-p’ 


(9) 


Now,  Rl  can  be  written  as 


MSB, 


Maximizing  R^  is  equivalent  to  minimizing  the  mean  squared  error  for  p  vmiables  MSB,,  therefore, 
a  good  feature  subset  will  be  associated  with  a  large  value  of  R^.  The  relationship  between  Rl  and 


R^is 


p  -  1  SSE, 
n-p  SSTO 


Mallows  suggests  a  criterion  based  on  Tninimizing  the  mean  squared  error  of  prediction  MSEP 
[42].  The  definition  of  MSEP  for  p  predictor  variables  is  the  same  as  MSEp  in  Equation  9,  except 
that  now  the  variables  y,  y,  and  n  correspond  to  an  independent  validation  data  set  (different  from 
the  data  set  used  to  estimate  the  regression  parameters). 

A  standardized  MSEP  criterion,  Fp,  is  defined  as  the  ratio  of  the  reduced  model’s  MSEP  over 
the  full  model’s  true  error  variance,  <r^.  This  criterion  is  meant  to  find  a  reduced  model  with  p 
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predictor  variables  that  provides  a  similar  M SEP  to  the  full  model.  Here,  the  full  model  is  assumed 
to  be  unbiased,  so  the  full  model’s  MSEP  is  an  unbiased  estimator  of  cr’. 


Mallow’s  Cp  criterion  is  an  estimator  of  Fp  and  is  defined  as 

SSEP, 

MSEP  ^ 

where  SSEP,  is  the  sum  of  squared  errors  for  p  predictor  variables  on  the  validation  data  set,  and 
MSEP  is  the  mean  square  prediction  error  for  the  full  model  on  the  validation  data  set.  If  there  is 
no  bias  in  the  reduced  model  of  p  predictor  variables,  then  SSEP,  SSEP  for  the  full  model,  and 
the  expected  value  of  C,  is  approximately  p.  For  a  good  feature  subset,  the  C,  criterion  should  be 
small  and  as  close  to  p  as  possible,  indicating  a  small  prediction  bias  associated  with  the  reduced 
regression  model  of  p  predictors. 

The  PRESS,  selection  criterion  proposed  by  Allen  is  based  on  minimising  the  sum  of  squared 
deleted  residuals  [2].  Let  d  be  the  n-dimensional  vector  of  deleted  residuals,  where  the  deleted 
residual  di  is  the  prediction  error  for  observation  i  when  a  regression  model  is  fit  without  the  »th 
observation.  The  PRESS,  statistic  for  a  model  with  p  predictor  variables  is  formed  by  the  sum  of 
the  squared  deleted  residuals  for  that  model. 


PRESS,  =  dd' 


An  equivalent  expression  for  dk  can  be  used  which  makes  it  unnecessary  to  re-estimate  the 
regression  model  for  each  di  [47:451].  The  equivalent  expression  is 


1-hii 


where  Cj  is  the  ordinary  residual  with  no  observations  deleted,  and  ha  is  the  tth  element  along 
the  diagonal  of  the  matrix  H  =  X(X'X)~^X'.  Models  with  the  lowest  PRESSp  for  a  subset  of 
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siie  p  fit  well  in  a  predictive  sense.  This  criterion  is  subjective  when  used  to  discriminate  between 
variable  subsets  of  different  sises  p. 

Akaike  proposed  a  criterion  developed  for  statistical  model  identification  based  on  information 
theoretic  considerations  [1].  The  criterion  can  be  defined  as 

>l/C,  =  log(/(y;X,b)]-p 

where  p  is  the  number  of  unknown  parameters,  and  /(p;X,b)  is  the  probability  density  function 
of  y  evaluated  at  b  (maximum  likelihood  estimates  of  the  p  unknown  parameters  of  the  subset 
model).  A  good  feature  subset  is  identified  by  a  maximum  AIC. 

Schwarts  proposes  a  Bayesian  version  of  AIC  which  is  also  maximised  when  used  for  feature 
selection  [71].  It  is  defined  as 

log[/(y;X,b)]-  ^plogn 

where  /(y;  X,  b)  is  the  probability  density  function  of  y  evaluated  at  b  (maximum  likelihood  esti¬ 
mates  of  the  p  unknown  parameters  of  the  subset  model),  and  n  is  the  number  of  samples.  This 
criterion  is  better  suited  than  Akaike’s  to  selecting  lower-dimensional  models.  When  n  is  large,  the 
two  procedures  may  produce  results  which  differ  greatly  [71:463]. 

The  selection  criteria  described  in  this  section  are  methods  for  measuring  the  relative  worth 
of  one  variable  subset  compared  to  another.  By  themselves,  however,  they  do  not  indicate  which 
variables  should  be  retained  in  a  linear  regression  model.  These  selection  criteria  in  concert  with 
the  selection  methodologies  described  next  can  be  used  to  determine  what  variables  to  retain. 

2.2.S  Selection  Melhodologiea.  Generally,  a  variable  selection  procedure  involves  both  a 
criterion  (metric)  to  evaluate  or  rank  variable  subsets,  and  a  methodology  to  select  the  beat  subsets. 
Variable  selection  involves  selecting  the  p  best  variables  from  among  q  candidate  variables,  where 
p<  q.  Some  of  the  well  known  methodologies  for  selecting  the  best  variable  subset  are  discussed  in 
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this  section.  The  well  known  methodologies  include:  explicit  enumeration  of  all  subset  regressions, 
best  k  subset  regressions,  and  sequential  search  of  subsets.  Two  other  methodologies  are  reviewed: 
ridge  regression  and  principal  components  regression. 

The  regression  of  all  subsets  requires  an  explicit  enumeration  of  all  2*  -  1  possible  subsets. 
When  this  approach  is  used,  one  of  the  selection  criterion  from  Section  2.2.2  is  used  to  subjectively 
discriminate  between  variable  subsets.  Sometimes,  a  small  subset  of  the  best  regression  models  is 
selected  for  further  detailed  examination.  This  method  is  the  most  computationally  intensive  of  the 
selection  methods  described.  Time-saving  algorithms,  which  evaluate  substantially  fewer  subsets, 
are  described  next. 

Best  k  subsets  algorithms  have  been  developed  which  give  the  best  k  subsets  according  to  a 
given  criterion.  Fumival  and  Wilson  propose  a  branch  and  bound  algorithm  which  uses  product 
inverse  matrices  and  a  sophisticated  sequence  of  pivots  or  Gaussian  eliminations  [18] .  The  algorithm 
evaluates  the  SSR,  at  each  step  against  some  botmd  to  determine  the  next  pivot.  Fumival  and 
Wilson’s  algorithm  provides  the  k  best  regressions  for  each  subset  size  p. 

Sequential  algorithms  are  substantially  less  computational;  however,  there  is  no  guarantee 
that  these  methods  provide  the  best  variable  subset.  Three  algorithms  for  sequential  selection 
are:  forward  selection,  backward  selection,  and  stepwise  selection.  These  algorithms  discriminate 
between  feature  subsets  using  a  measure  of  the  predictive  or  explanatory  capability  of  the  regression 
function.  A  measure  of  a  regression  function’s  predictive  capabilities  for  a  specific  subset  is  given  by 
the  metrics  described  in  the  previous  section.  A  conditional  measure  of  the  additional  explanatory 
contribution  of  a  variable  or  a  set  of  variables  is  embodied  in  the  F-statistic  given  in  a  partial 
F-test. 
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The  partial  F-test  is  used  to  test  the  significance  of  one  or  more  variables.  The  linear  regres¬ 
sion  model  in  Equation  3  is  used  with  the  partial  F-test.  It  is  given  again  for  reference  as 

y  =  X/9  -I-  «  *  ~  N,(0, 

where  y  is  an  n-dimensional  vector  of  responses;  X  is  an  n  x  ik  matrix  of  regressor  variables 
consisting  of  J;  -  1  independent  variables  and  a  column  of  I’s  for  the  constant  term;  and  is  a 
(k  -|-  l)-dimen8ional  vector  of  unknown  variable  coefficients.  The  partial  F-test  is  associated  with 
the  null  hypothesis: 

Ho  :  0f+t  =  •  •  •  =  01^  =  0, 

where  p  is  the  number  of  parameters  hypothesised  to  be  in  the  model,  and  k  -  p  are  the  number 
of  parameters  hypothesized  to  be  0.  When  testing  the  significance  of  just  a  single  variable,  then 
jfe-p=lorp  =  i-  lin  Ho.  The  associated  test  statistic,  denoted  F*,  is  defined: 

33B.-8SB> 

^  (10) 

n-Jfc 

Under  Ho: 

F*  ~  Fk-p,n-k 

When  F*  <  Fk-p,n-k,  the  null  hypothesis  can  not  be  rejected  which  means  the  reduced  model 
with  some  parameters  hypothesized  to  be  zero  is  accepted.  The  variables  which  are  not  in  the 
reduced  model  correspond  to  the  parameters  which  are  hypothesized  to  be  zero.  Otherwise,  if 
F*  >  Ffc_p,n-t,  then  the  null  hypothesis  can  be  rejected.  This  means  that  the  full  model  is 
accepted  and  no  variables  are  removed. 

The  sequential  selection  algorithms  presented  in  this  section  are  based  on  the  F-statistics 
computed  for  the  partial  F-test.  The  strategy  is  to  apply  successive  partial  F-tests  to  test  the 
explanatory  contribution  of  a  variable  to  the  total  sum  of  squares. 
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The  forward  selection  procedure  starts  with  no  variables  in  the  model  and  increases  the 
number  of  variables  one  at  a  time.  At  each  step,  partial  f-statistics  are  computed.  If  the  null 
hypothesis  can  not  be  rejected,  then  the  algorithm  terminates  at  that  number  of  variables  in  the 
model. 


Forward  Selection 

1.  p  =  0 

2.  Compute  all  regression  models  for  size  ib  =  p  +  1  which  include  all  previotuly  selected  p  vari¬ 
ables  in  the  model 

3.  Select  as  a  candidate  for  entry,  the  new  variable  associated  with  the  model  having  the  highest 
F*  statistic  computed  using  Equation  10 

4.  Perform  partiid  F-test  described  beginning  on  page  23 

5.  If  >  P*  allow  candidate  variable  to  enter  model,  set  p  =  p  -f  1  and  go  to 

Step  2  ’ 

Otherwise,  go  to  Step  6 

6.  Stop 

The  backward  selection  procedure  starts  with  all  the  variables  in  the  model  and  decreases 
the  number  of  variables  one  at  a  time.  At  each  step,  partial  F-statistics  etre  computed.  If  the  null 
hypothesis  is  rejected  at  any  step,  the  algorithm  terminates.  Advzmtages  of  a  backward  selection 
procedure  over  a  forward  selection  procedure  are  discussed  in  Mantel  [43].  These  advantages  include 
economy  of  effort  and,  potentially,  better  variable  selection  when  correlated  variables  are  present. 
Because  of  these  advantages,  the  backwards  sequential  selection  algorithm  is  used  in  the  selection 
procedures  presented  in  Chapters  V  and  VI. 

Backward  Selection 

1.  k  =  9,  where  q  is  the  total  number  of  candidate  variables 

2.  p  =  *  -  1 
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3.  Compute  all  regression  models  for  size  p  which  do  not  include  any  previously  eliminated  vari¬ 
ables 

4.  Select  as  a  candidate  variable  for  elimination,  the  variable  which  when  removed  produces  the 
model  with  the  lowest  F"  statistic  computed  using  Equation  10 

5.  Perform  partial  F-test  described  beginning  on  page  23 

6.  If  F|k-p,n-k  ^  fk-p,n.-k,  go  to  Step  7  and  do  not  eliminate  candidate  variable 
Otherwise,  eliminate  candidate  variable,  set  k  =  k  —  1,  and  go  to  Step  2 

7.  Stop 


Stepwise  selection  procedures  include  forward  and  backward  stepwise  procedures.  They  are, 
essentially,  modifications  of  the  forward  and  backward  selection  procedures.  The  forward  stepwise 
procedure  tests  to  see  if  any  other  variables  should  be  eliminated  after  each  iteration  of  the  al¬ 
gorithm.  The  backward  stepwise  procedure  tests  to  see  if  any  other  variables  should  be  included 
after  each  iteration  of  the  algorithm.  The  forward  stepwise  algorithm  is  illustrated  here.  Like  the 
forward  selection  algorithm,  the  forward  stepwise  selection  algorithm  starts  with  no  variables  in 
the  model. 


Forward  Stepwise  Selection 


1.  p  =  0 


2.  Compute  all  regression  models  for  size  ib  =  p  -|- 1  which  include  all  previously  selected  p  vari¬ 
ables  in  the  model 

3.  Select  as  a  candidate  for  entry,  the  new  variable  associated  with  the  model  having  the  highest 
F*  statistic  computed  using  Equation  10 

4.  Perform  partial  F-test  described  beginning  on  page  23 

5.  If  >  f’k-p,n-k,  allow  candidate  variable  to  enter  model,  and  go  to  Step  6 

Otherwise,  go  to  Step  9 
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6.  Compute  all  regression  models  of  sise  p  which  include  only  variables  that  are  currently  in  the 
model 

7.  Select  as  a  candidate  for  elimination,  the  variable  which  when  removed  produces  the  model 
with  the  lowest  F*  statistic 

8.  If  set  p  =  p  +  1,  and  do  not  eliminate  candidate  variable 

Otherwise,  eliminate  the  candidate  variable  and  go  to  Step  2 

9.  Stop 

Pope  and  Webster,  and  Krishnaiah  point  out  problems  with  the  use  of  the  F-statistic  for  the 
forward,  backward,  and  stepwise  sequential  procedures  [36,  49].  Krishnaiah  recommends  not  using 
these  procedures  for  the  selection  of  variables,  since,  in  general,  the  necessary  conditions  are  not 
met  for  F"  to  be  distributed  as  an  F  distribution  [36:814]  [49:331-332].  Instead,  Krishnaiah  recom¬ 
mends  using  the  overall  F  test  and  methods  based  on  all  possible  regressions  or  finite  intersection 
tests  (FIT).  The  FIT,  developed  by  Krishnaiah,  involves  using  a  multivariate  F-distribution  for 
simultaneously  testing  whether  a  set  of  variable  coefficients  are  zero  [65].  Details  for  applying  the 
FIT  for  either  univariate  or  multivariate  regression  can  be  found  in  Schmidhammer  [65]. 

Another  technique  which  can  be  utilized  for  feature  selection  is  ridge  regression.  Ridge  re¬ 
gression  is  a  biased  regression  technique  primarily  designed  as  a  remedy  for  mtilticollinearity.  The 
technique  introduces  a  biasing  constant,  c,  where  c  >  0,  into  the  least  squares  normal  equation. 
Equation  5,  to  give 

(X'X  -I-  cl)b"  =  X'y 

A  simple  way  of  choosing  c  is  to  increase  c  gradually  and  plot  a  ridge  trace.  A  ridge  trace  is  a  plot  of 
the  regression  coefficients  against  c.  Variables  which  are  candidates  for  elimination  are  associated 
with  unstable  ridge  traces  (details  can  be  found  in  Neter  Wasserman  and  Kutner  [47:414-417])  and 
coefficients  close  to  zero. 

Principal  components  regression  is  yet  another  technique  which  can  be  utilized  for  feature 
selection.  In  principal  components  regression,  the  original  variables  are  transformed  using  linear 
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combinations  suggested  by  the  eigenvectors  of  the  covariance  matrix.  Transformed  variables  asso¬ 
ciated  with  very  small  eigenvalues  can  be  identified  as  candidates  for  elimination,  since  they  are 
not  associated  with  a  very  big  portion  of  the  original  variables’  variance.  There  ue  two  prob¬ 
lems  associated  with  this  method:  difficulty  in  interpreting  the  transformed  variable,  and  lack  of  a 
quantitative  stopping  rule  to  indicate  the  number  of  transformed  variables  to  drop. 

2.3  Nordinear  Regression 

There  is  a  limit  to  what  can  be  adequately  approximated  by  a  linear  model,  even  after  exploit¬ 
ing  transformations  of  the  dependent  or  independent  variables.  When  this  is  the  case,  a  nonlinear 
regression  model  is  usually  considered.  This  section  reviews  univariate  nonlinear  regression  and 
standard  statistical  selection  criteria.  Multivariate  response  extensions  to  the  univariate  case  are 
presented  in  Chapter  5  of  Gallant  [20]. 

2.3.1  Univariate  Nonlinear  Regression  Overview.  Univariate  response  nonlinear  regression 
is  analogous  to  linear  regression:  it  is  a  technique  which  describes  the  statistical  relationship  between 
a  response  variable  and  a  9-dimensional  set  of  fixed  predictor  variables.  The  predictor  variables 
are  treated  as  fixed  known  constants  and  not  as  random  variables  [20:2].  However,  it  is  possible 
to  consider  the  “random  predictor  variables  case”  of  nonlinear  regression  as  a  special  case  of  the 
“fixed  predictor  variables  theory  [20:247].”  The  discussion  of  the  nonlinear  regression  model  in  this 
section  is  taken  from  both  Gallant’s  [20]  and  Seber  and  Wild’s  [66]  presentations. 

Consider  the  univariate  nonlinear  regression  model 

y  =  f{X,0)  +  e  e~N„(0,cr»I„) 

where,  y  is  an  n-dimensional  vector  of  responses,  X  is  an  n  x  s  matrix  with  9  columns  of  predictor 
variables  and  a  column  of  1  ’s  for  the  bias  term,  0  is  an  s-dimensional  vector  of  unknown  parameters. 
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and  /(X,  9)  is  the  response  function.  The  functional  form  hypothesized  for  /(X,  B)  is  normally 
chosen  based  on  the  problem  at  hand  or  experience. 

The  predicted  value  of  y,  denoted  y,  is  defined 

y  =  /(x,d) 

where  B  is  the  least  squares  estimate  for  B.  The  least  squares  estimator  6  is  found  by  minimizing 
the  sum  of  squared  errors  SSE 

SSE  =  [y-/(X,9)r[y-/(X,9)] 

with  respect  to  B. 

In  neural  network  terms,  the  vector  y  is  the  n  x  1  vector  of  observations  of  desired  outputs 
whereas  the  vector  y  is  the  n  x  1  vector  of  actual  outputs  or  trained  neural  network  outputs. 
The  matrix  X  is  the  data  matrix  of  measured  feature  vectors,  including  the  bias  term  which 
is  equal  to  one  for  each  vector.  The  vector  6*  is  euaalogous  to  the  vector  of  unknown  optimal 
neural  network  weight  parameters,  and  the  vector  B  is  the  vector  of  trained  or  estimated  neural 
network  weight  parameters.  Also,  SSE  is  the  squared  error  function  which  is  minimized  in  standard 
backpropagation.  In  Chapter  V,  the  neural  network  model  will  be  posed  in  the  framework  of  a 
nonlinear  regression  statistical  model. 

To  find  the  parameters  which  minimize  SSE,  the  nonlinear  function  /(X,  B)  is  expanded  in  a 
Taylor  series  about  B  =  B*,  retaining  only  the  linear  terms  as  [46:427]: 

/(X,9)  «  /(X,r)  +  F(X,9*)(9  -  9*), 

where 

F(X,»)=^/(X,»)  (11) 
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and  F(X,0*)  represents  F  evaluated  at  0  =  6*.  The  truncated  Tkylor  series  representation  is 
essentially  a  linearization  of  /(X,9)  in  the  neighborhood  of  the  unknown  parameter  9*.  With  this 
form  of  /(X,  9),  several  methods,  such  as  the  Gauss-Newton  method,  exist  for  computing  the  least 
squares  parameters,  9  [20:26-46]. 

In  general,  estimators  for  9  are  not  unbiased  [46:426].  The  properties  of  unbiasedness  and 
minimTun  variance  are  only  approached  asymptotically  or  in  the  limit  when  n  the  number  of  obser¬ 
vations  is  large  [46:426] .  The  distributional  properties  of  normal-error  least  squares  parameters  have 
practical  importance,  since  the  assumed  distributional  properties  are  used  in  hypothesis  testing  to 
make  inferences  about  the  parameters.  With  the  assumption  of  normal  errors  and  certain  regularity 
conditions,  9  has  approximately  an  s-dimensional  multivariate  normal  distribution  defined  as 

9  ~  N.[9,8*(F'F)-^] 
where  F  is  defined  in  Equation  11,  and 

j  SSE 

s  = - 

n  —  8 

is  an  estimate  of  the  error  va^i^ulce  (r^  corresponding  to  9. 

In  practice,  the  matrix  C  is  used  to  approximate  (FT)“^,  assuming  (F'F)“*  exists,  where 

C  =  (F(9)'f(9))-'  (12) 


The  assumption  of  normal  errors  and  certain  regularity  conditions  also  gives 


(n  -  s)8*  j 

J - - -  rM  -v 


Gallant  rigorously  presents  the  regularity  conditions  used  to  make  these  distributional  assumptions 
[20].  The  conditions  are  summarized  below  [20:19-21]: 
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1.  The  sequence  of  observations  on  predictor  variables  x  must  behave  properly  as  n  tends  to 
infinity.  Proper  behavior  is  obtained  when: 

•  The  predictor  variable  observations  are  chosen  by  random  sampling. 

•  The  predictor  variable  observations  are  chosen  by  disproportionate  replication  of  a  fixed 
set  of  points. 

2.  The  response  function  /(X,  0),  as  well  as  it’s  first  and  second  partial  derivatives,  must  be 
continuous  in  (X,  9). 

3.  The  identification  condition  holds  which  requires  that  the 

limn_ao  n~^  has  a  unique  minimum  at  9  =  0*,  where  Xj  is  the  ith 

predictor  variable. 

4.  The  rank  qualification  condition  holds  which  requires  that  the 

lim„_oo  52r=i  ®*)®*(x<, 9’)  be  nonsingular. 

2.3.2  Statistical  Model  Selection  Criteria.  In  this  section,  three  statistical  selection  criteria 
for  nonlinear  regression  are  reviewed:  the  Wald  test  statistic,  the  likelihood  ratio  test  statistic, 
and  the  Lagrange  multiplier  test  statistic.  Seber  and  Wild  discuss  the  asymptotic  equivalence  of 
these  statistics  [66:571-581].  All  three  criteria  are  test  statistics  which  are  formed  during  hypothesis 
testing.  Several  detailed  examples  of  applying  these  test  statistics  for  hypothesis  testing  are  given  in 
Gallant  [20:48-100].  In  this  section,  these  criteria  are  presented  and  reviewed  in  nonlinear  regression 
terminology  for  ease  of  access  into  the  nonlinear  regression  literature.  In  Chapter  V,  these  criteria 
will  be  redefined  for  use  in  the  context  of  a  neural  network  model. 

The  first  model  selection  criterion  is  the  Wald  test  statistic  W  [20:48].  Define  h[0)  as  a  once 
differentiable  function  mapping  from  72*  to  with  a  «  x  k  Jacobian  H  defined  as 

H(») = (14) 

where  s  is  the  number  of  parameters  in  the  full  model,  and  k  are  the  number  of  parameters 
hypothesized  to  be  zero  in  the  reduced  model.  Consider  testing  the  hypothesis  Ho  which  is  given 
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as 


Ho  :  M«*)  =  0. 

The  Wald  test  statistic  is  defined  as 

ks^ 

where  C  and  H  =  H(v)  were  defined  previously  in  Equations  12  and  14,  is  for  the  full  model  with 
s  parameters,  and  k  is  the  number  of  parameters  in  the  hypothesized  model.  The  Wald  test  statistic 
is  formed  using  the  ratio  of  two  independent  chi-squared  statistics  each  divided  by  its  respective 
degrees  of  freedom.  Under  Hq,  the  numerator,  to  within  approximation  error,  is  a  chi-squared 
statistic  with  k  degrees  of  freedom  which  is  associated  with  the  null  hypothesis  (see  Gallant  for 
details  [20:47-48]).  The  denominator,  to  within  approximation  error,  is  a  chi-square  statistic  with 
n  -  s  degrees  of  freedom,  and  is  associated  with  the  full  model  [20:47*48].  The  chi-squared  statistic 
in  the  denominator  was  previously  shown  in  Equation  13. 

Under  Hq,  the  Wald  statistic  is  distributed  as  an  F-distribution  with  k  munerator  and  n  -  s 
denominator  degrees  of  freedom.  When  W  exceeds  the  a  x  100%  critical  point  of  the  F-distribution 
with  k  numerator  and  n  —  s  denominator  degrees  of  freedom,  the  h3rpothesis  Hq  is  rejected.  This 
means  that  the  reduced  model  with  fewer  parameters  is  statistically  not  equivalent  to  the  full  model 
with  all  s  parameters.  Therefore,  no  parameters  would  be  removed. 

The  second  model  selection  criterion  is  the  likelihood  test  statistic,  which  is  analogous  to 
the  partial  F-test  discussed  in  the  previous  section.  The  likelihood  test  statistic  L  is  also  used  to 
test  Ho  :  h(9*)  =  0,  where  h{9*)  is  defined  to  be  a  subset  of  k  parameters  from  the  full  set  of  s 
parameters.  It  is  defined  as 

(SSE,_t  -  SSE,)  /(  (n  -  (s  -  k))  -  (n  -  s)  ) 

(SSE,)/(n-s) 
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where  SSE,-k  corresponds  to  the  sum  of  squared  errors  associated  with  fitting  the  model  under 
the  null  hypothesis,  and  SSE,  corresponds  to  the  sum  of  squared  errors  associated  with  the  full 
model.  Under  Hq,  L  is  distributed  as  an  E  distribution  with  k  and  n-s  degrees  of  freedom  for  the 
numerator  and  denominator,  respectively.  When  the  L  exceeds  the  a  x  100%  critical  point  of  the 
E*distribution  with  k  numerator  and  n  —  s  denominator  degrees  of  freedom,  the  hypothesis  Hq  is 
rejected.  This  means  that  the  reduced  model  with  fewer  parameters  is  statistically  not  equivalent 
to  the  full  model  with  all  a  parameters.  Therefore,  no  parameters  would  be  removed. 

The  third  model  selection  criterion  is  the  Lagrange  multiplier  or  efficient  score  test  statistic. 
There  are  two  versions  of  this  test  statistic  discussed  in  Gallant  [20:85-97].  The  Lagrange  multiplier 
test  statistic  is  also  used  to  test  Hq  :  h{6*)  =  0,  where  h{d*)  is  defined  to  be  a  subset  of  k 
parameters  from  the  fiill  set  of  a  parameters.  Let  SSE  be  minimized  for  the  full  model  subject  to 
the  null  hypothesis  h{0*)  =  0.  This  is  a  constrained  minimization.  The  resulting  estimator  of  the 
constrained  minimization  is  denoted  6. 

Now  B  is  used  as  a  starting  value  and  one  ‘unconstrained’  Gauss-Newton  step  is  taken  away 
from  0,  presumably  towards  B  which  is  the  least  squares  estimator  for  the  unconstrained  minimiza¬ 
tion  of  SSE.  The  Gauss  Newton  step,  denoted  D  is  defined  in  Gallant  [20:85]  as 


D=  (F'F)"'p'[y-/(0)], 


where  F  =  F(B)  as  previously  defined  in  Equation  11.  If  Ho  is  true,  then  D  will  be  small. 
Conversely,  if  Ho  is  false,  then  D  will  be  large.  The  two  forms  of  Lagrange  test  statistics,  Ri  and 
jR2,  are  based  on  a  measure  of  D.  They  are  defined  in  Gallant  [20:86]  as: 


El 


D'  (f'f)  D/k 
SSE(B)/(n  -  a) 


"O'  P 

SSE(0) 
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where  SSE(9)  represents  the  minitnum  sum  of  squared  errors  when  fitting  the  full  model  with  no 
constraints,  and  SS£(9)  represents  the  constrained  minimum  sum  of  squared  errors  when  fitting 
the  full  model  subject  to  Hq-  With  Ri,  the  Lagrange  multiplier  test  rejects  Ho  when  Ri  exceeds 
Fa  =  fT-a!fc,n-«-  With  JKa>  the  Lagrange  multiplier  test  rejects  Ho  when  exceeds  da  which  is 
defined 

d  = 

(n  -  s)/k  +  Fa 

2.3.3  Summary  of  Statistical  Model  Selection  Criteria.  In  Summary,  there  are  three  model 
selection  hypothesis  tests  used  in  feature  selection  for  nonlinear  regression:  the  Wald  hypothesis 
test,  the  likelihood  ratio  test  and  the  Lagrange  multiplier  tests.  The  Wald  test  statistic  has  both 
advantages  and  disadvantages.  The  advantage  is  that  only  the  unconstrained/full  model  must  be 
estimated  for  this  criterion.  However,  the  disadvantage  is  that  Wald  test  statistic  can  be  seriously 
affected  by  parameter  effects  curvatures  of  the  error  surface  which  are  not  accounted  for  in  the 
linearized  approximation  used  for  /(X,  0)  [66:200-220].  The  likelihood  ratio  test  statistic  and  the 
Lagrange  multiplier  test  statistics  are  not  affected  by  the  parameter  effects  curvatures  [66:200]. 
Between  the  two  versions  of  the  Lagrange  multiplier  test,  the  first  test  Ri  is  always  more  powerful 
than  R2,  although  this  increase  in  power  is  sometimes  negligible  [20:88-89].  In  practice,  iZj  is  more 
commonly  used  than  Ri,  since  R2  requires  just  one  minimization  of  SSE  rather  than  two,  making 
R2  easier  to  compute  [20:86-87].  When  comparing  the  likelihood  ratio  test  statistic  and  the  two 
versions  of  the  Lagrange  multiplier  test  statistic,  the  likelihood  test  statistic  is  always  the  more 
powerful  [20:89]. 

The  last  two  sections  reviewed  feature  selection  topics  related  to  linear  and  nonlinear  regres¬ 
sion.  The  next  section  reviews  feature  selection  topics  related  to  discriminant  analysis. 
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i.4  Discriminant  Analj/sis 


La  discriminant  analysis  problems,  the  practitioner  is  concerned  with  classifying  a  pattern 
or  an  object  into  one  of  K  classes.  A  pattern  or  an  object  is  represented  by  a  vector  of  features 
which  ideally  contain  discriminatory  information  between  the  classes.  There  are  many  algorithms 
available  for  doing  discriminant  analysis.  The  vast  number  of  algorithms  makes  it  impossible  to 
provide  a  concise  overview  of  these  algorithms  in  this  document.  Devyer  and  Kittler,  Dillon  and 
Goldstein,  and  James  are  all  good  references  on  discriminant  analysis  and  pattern  recognition 
algorithms  [14, 15,  29]. 

Feature  selection  in  these  applications  is  similar  to  linear  regression,  since,  ideally,  feature 
subsets  wordd  be  evaluated  based  on  a  measure  of  the  classification  function’s  accuracy.  In  linear 
regression,  calculating  this  measure  is  very  simple,  but  for  discriminant  analysis  it  can  be  pro¬ 
hibitively  time  consuming.  Therefore,  feature  selection  is  often  carried  out  by  trying  to  maximise 
an  alternative  measure.  This  alternative  measure  is  hopefully  related  closely  to  the  error  rate  of 
the  resulting  classifier  [29:127].  There  are  several  good  references  which  review  feature  evaluation 
criteria  and  feature  selection  methodologies  in  the  context  of  pattern  recognition  and  discriminant 
analysis  [8,  14,  15,  29,  74].  In  the  remainder  of  this  section,  standard  distance  metrics  used  for 
evaluating  or  ranking  features  and  feature  selection  methodologies  are  reviewed. 

2.4-1  Selection  Metrics.  A  common  need  for  all  feature  selection  procedures  is  to  have  an 
evaluation  function  for  measiudng  the  saliency  or  potency  of  features.  This  section  reviews  non- 
probabilistic  and  probabilistic  feature  evaluation  metrics.  The  non-probabilistic  feature  evaluation 
metrics  are  reviewed  first,  since  they  are  relatively  simple  to  calctdate.  After  a  discussion  of  the 
non-probabilistic  metrics,  the  more  sophisticated,  often  computationally  burdensome,  probabilistic 
metrics  are  reviewed. 

Non-probabilistic  feature  evaluation  metrics  are  used  as  an  alternative  to  measuring  the  error 
rate  or  a  sophisticated  probabilistic  measure  of  the  classifier’s  performance.  Generally,  these  metrics 
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are  a  measure  of  distance  between  classes.  The  non-probabilistic  feature  evaluation  metrics  are 
based  on  the  rationale  that  classes  should  be  maximally  separated  in  the  feature  space.  The  larger 
the  average  separation  between  classes,  the  better  the  feature  subset.  While  non-probabilistic 
distance  metrics  are  somewhat  unsophisticated  compared  to  the  probabilistic  metrics,  they  are 
relatively  easy  to  calculate. 

Generally,  these  types  of  metrics  attempt  to  maximize  the  between  class  (interclass)  distances 
while  minimizing  the  within  class  (intraclass)  distances.  An  estimated  matrix  of  the  between  class 
distances,  St,  is  defined  as 

K 

S»  =  5]  P{Ck){mi  -  m)(mfc  -  m)', 

k=l 

where  mt  is  a  M-dimensional  mean  vector  of  the  Ath  class  of  M-dimensional  training  vectors,  m  is 
a  ilf -dimensional  mean  vector  of  all  the  training  vectors,  M  is  the  number  of  features  in  a  training 
vector,  and  P(Cu)  is  the  estimated  prior  probability  of  class  ifc,  denoted  as  Ct-  An  estimated  matrix 
of  the  within  class  distances,  S« ,  is  defined  as 

s,  =  5^  p[Ck)Nu~^  “  *“»)(*«  -  *“*)'. 

fc=i  t=i 

where  is  the  tth  training  vector  from  the  ith  class  and  is  the  number  of  training  vectors  in 
class  k.  Two  possible  metrics  D{x)  which  are  maximized  to  find  a  good  subset  of  features  are 


i?(x) 


trSt 

trS« 


D{x) 


|S| 

|S.l 


where  S  is  equal  to  S«  -H  S^.  Also,  |  •  |  denotes  calculation  of  a  determinant.  The  main  criticism 
of  these  types  of  metrics  is  that  they  are  not  closely  related  to  error  probability.  Two  additional 
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drawbacks  of  these  metrics  are:  discriminatory  potential  of  these  metrics  depends  on  the  classes 
having  equal  covariance  matrices;  and,  pattern  classes  with  equal  means  may  give  misleading  restilts. 

A.  general  nonlinear  metric  D{x)  which  reflects  the  local  probability  structure  of  the  data  can 
be  maximised  to  find  a  good  subset  of  features  [14:242] 


Nk  N, 


<=i j=i 


where  d(xu,xij)  is  the  nonlinear  distance  metric  between  the  tth  vector  in  class  i  and  the  yth 
vector  is  class  /.  This  nonlinear  distance  is  equal  to  a  constant  J7,  if  its  Euclidean  distance  is 
above  a  threshold  T,  otherwise  it  is  equal  to  sero.  The  threshold  T  represents  a  safe  or  effective 
distance  for  correctly  classifying  the  two  points  into  separate  classes.  Methods  for  determining 
T  are  discussed  in  Devijver  and  Kittler  [8:197-198,242-245].  This  nonlinear  metric  represents  a 
compromise  between  probabilistic  and  non-probabilistic  feature  evaluation  metrics. 

The  following  survey  of  probabilistic  distance  metrics  is  primarily  based  on  the  discussion  in 
Ben-Basset  and  Devijver  and  Kittler  [8, 14].  Ben-Basset  categorizes  probabilistic  feature  evaluation 
metrics  into  three  categories  by  [8:778]: 


1.  Distance  metrics  derived  from  Information  measures. 

2.  Distance  metrics  derived  from  distance  measures. 

3.  Distance  metrics  derived  from  dependence  measures. 


The  probabilistic  feature  evaluation  metrics  are  reviewed  using  this  taxonomy. 

The  first  category  of  probabilistic  feature  evaluation  metrics  are  the  information  metrics. 
Probabilistic  information  measures  are  sometimes  referred  to  as  imcertainty  measures.  Given  an 
uncertainty  function  u  and  a  prior  probability  vector  w,  the  information  gain,  I(x)  can  be  defined 
in  general  terms  as 

I(x)  =  «(w)  -  E.[u(w(x))] 
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T&ble  1.  Distance  Metrics  Based  on  Probabilistic  Uncertainty 


Uncertainty  Function 

Definition 

Bayes  P, 

«(»(*))  =  (1  -  max{P(C7ilx),  •  •  • ,  P(Cklx)}] 

Shannon  B 

«(*(:■)) = -  e:..  i>(c,iT)iogi>(c.ix) 

Quadratic  Q 

«(*W)  =  Ef=.  i“(c.l*)(i  -  mix)) 

Daroczy  Qa 

/-entropy 

«{»(x))  =  EL»/(mix)) 

Renyi 

This  table  adapted  from  [8:780]. 

where  u(x)  is  the  prior  uncertainty  and  Es[u(ir(x))]  is  the  posterior  uncertainty  using  the  vector 
of  posterior  probabilities  w(x).  The  prior  uncertainty,  u{ir),  is  independent  of  x,  so  an  information 
metric  based  on  x  can  be  reduced  to 


U(x)  =  E,[a(w(x))] 

where  we  want  to  find  the  set  of  features  x  which  minimizes  the  uncertainty  U{x). 

Uncertainty  metrics  differ  by  their  definition  of  u(ir(x)).  Define  the  posterior  probability 
of  Class  k  as  P(C'fc|x)  which  is  used  in  Table  1  to  define  several  different  uncertainty  metrics 
[8:779-780].  All  of  the  uncertainty  metrics  are  bounded  above  and  below  by  some  function  of  P,, 
also  defined  in  Table  1  (see  Ben-Basset  [8:780]).  Also,  Bayes,  Shannon,  Quadratic,  and  Daroczy 
uncertainty  functions  all  belong  to  the  /-entropy  family  of  uncertainty  metrics  defined  in  'Dible  1. 

The  Bayesian  probability  of  error,  P,,  is  probably  the  most  conomonly  used  feature  selection 
metric.  It  falls  into  the  category  of  probabilistic  uncertainty  metrics.  The  Pg  feature  evaluation 
metric  is  commonly  used  whenever  the  goal  is  minimizing  classifier  error  rate  with  features  of  equal 
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measurement  cost.  Let  the  posterior  probability  of  class  k  for  x,  P(Ck|x)  be  defined  as 


nc.w 


P(C.)f(x|C.) 

Ef=.P(C.)P(x|C.) 


where  P(Ck)  is  the  prior  probability  of  class  k  and  P(x|Ck)  is  the  conditional  probability  function 
of  X  for  class  ib.  Now,  the  metric  can  be  given  as  [8:774] 


P.(x)  =  E,[l  -  max{P(Ci|x),-  -.PCCklx)}] 


(16) 


where  x  is  the  vector  of  feattires  for  which  Pg  is  measured. 

Ben-Basset  discusses  several  reasons  why  alternative  feature  evaluation  metrics  may  be  pre¬ 
ferred  to  Pg  [8:776-777].  First,  the  Pg  may  not  be  sensitive  enough  to  discriminate  between  good 
and  better  features,  because  it  is  based  on  the  most  probable  class  which  can  be  strongly  influenced 
by  the  prior  probabilities  P(Ck).  Second,  sequential  selection  of  features  using  P,  does  not  en¬ 
sure  good  performance  on  resulting  subsets,  even  for  conditionally  independent  features  [8, 11, 16]. 
Third,  other  feature  evaluation  criterion  may  perform  better  when  the  objective  of  a  myopic  se¬ 
quential  selection  procedure  is  to  reach  a  predetermined  level  of  Pg  by  a  minimum  number  of 
features.  Lastly,  computation  of  Pg  may  be  costly,  since  it  involves  integration  of  the  function 
max{P(Ci|x),  •  ■ P(Ck|x)}.  Of  the  four  reasons  just  discussed,  Ben-Basset  concludes  that  the 
most  significant  reason  for  avoiding  Pg  is  that  it  may  not  be  sensitive  enough  to  discriminate  be¬ 
tween  good  and  better  features  [8:788].  Even  if  alternate  metrics  offer  no  computational  advantage 
over  Pg,  they  should  be  considered  when  Pg  becomes  highly  insensitive  due  to  a  relatively  high 
prior  probability  for  one  class  [8]. 

The  second  category  of  probabilistic  feature  evaluation  metrics  are  the  distance  metrics. 
Probabilistic  distance  metrics  are  based  on  distances  between  probability  measures.  Measures  of 
probability  are  prior  probability  density  functions,  posterior  probability  density  functions,  and 
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conditional  density  functions.  These  metrics  are  also  known  as  discriminatory,  separability,  or 
divergence  measures.  Probabilistic  distance  metrics  can  be  divided  into  two  classes: 

1.  Those  based  on  the  distance  between  prior  probability  density  functions  and  posterior  prob¬ 
ability  density  functions  of  each  class. 

2.  Those  based  on  distances  between  conditional  density  functions. 

Probabilistic  distance  functions  of  the  first  type  are  based  on  the  rationale  that  a  feature 
subset  which  changes  the  assessment  of  the  true  class  probability  is  a  good  feature  subset.  The 
more  drastic  the  assessed  change,  the  better  the  feature  subset.  This  type  of  metric  is  similar  to 
the  probabilistic  information  metrics  except  that  distance  functions  are  used  instead  of  uncertainty 
functions.  Table  2  presents  several  different  distance  measures  based  on  distance  between  prior 
and  posterior  probability  density  functions  for  each  class  [8:783].  For  these  distance  measures,  the 
probability  of  x,  denoted  P(x)  is  defined 

i’(x)  = 

k=l 

where  P{Cit)  is  the  prior  probability  of  class  k,  and  P(x|Cfc)  is  the  class-conditional  probability 
density  function  of  x  for  class  k. 

Probabilistic  distance  functions  of  the  second  type  are  based  on  the  rationale  that  the  larger 
the  distance  between  class-conditional  density  functions  P(xlCit),  the  easier  it  will  be  to  discrim¬ 
inate  between  classes.  For  a  two  class  problem,  this  type  of  distance  function  is  minimized  when 
P(x|C7i)  =  P(x|C2),  and  is  maximized  when  P(x|Ci)  and  P(x|C2)  are  orthogonal  [8:781].  For  a 
multi-class  problem,  the  distance  function  can  be  generalized  to  be  the  weighted  sum  of  distances 
between  class-conditioned  density  functions  for  all  pairs  of  classes  [8:781]  [14:261].  The  distance 
function  for  a  multi-class  function  measures  the  discriminatory  power  of  the  evaluated  feature  vector 
over  aU  K  classes  as 

K  K 

D{^)  =  'E'£,nct)p(c,)u^) 

k=l 1=1 
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where  (ffci(x)  is  the  distance  between  class  k  and  class  I’s  class-conditioned  density  functions.  A 
disadvantage  of  this  metric  is  that  one  large  du  may  dominate  D{x)  by  imposing  a  ranking  which 
is  biased  by  the  two  most  separable  classes  [8:782].  Table  3  presents  several  versions  of  du  adapted 
from  Ben-Basset  [8:784]  and  Devijver  and  Kittler  [14:257-258].  Functions  of  the  distance  functions 
in  this  table  serve  as  lower  and  upper  bounds  for  Pt  [8:784].  Another  multi-class  distance  measure 
between  class-conditional  density  functions  is  Matusita’s  extension  of  the  iniinity  distance  measure, 
where  D{x)  is  defined  as  [8:782] 

k=i 

The  third  category  of  probabilistic  feature  evaluation  metrics  are  the  dependence  metrics. 
Probabilistic  dependence  metrics  can  also  be  referred  to  as  correlation  metrics.  They  are  natural 
multi-class  feature  selection  metrics  related  to  both  probabilistic  information  measures  and  prob¬ 
abilistic  distance  measures.  The  probabilistic  dependence  metrics  are  based  on  the  rationale  that 
correlation  is  important  between  an  evaluated  feature  vector  and  its  true  class.  The  larger  the 


Table  2.  Distance  Metrics  Based  on  Prior  and  Posterior  pdfs 


Distance  Function 

Definition 

Affinity 

Bayesian 

/[Er=ii’(c*ixnp(x)dx 

Directed  Divergence 

/Ef=.  nc.Wioga^]j>(x)*c 

Divergence  of  Order  a  >  0 

/[logEr=i  P(C'*|x)««P(d>^-“)]P(x)dx 

Variance 

/[EL^  P(C*)(F(C'.lx)  -  P(C»))»]F(x)dx 

This  table  adapted  from  [8:783]. 
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Table  3.  Distance  Metrics  Based  on  Class-Conditioned  Density  Functions 


Distance  Function 

Definition  du 

Chemoff 

-log  /(P(x|C»)“P(x|C,))<^-“)dx 

Bhattacharyya  (Chemoff  where  a  =  .5) 

-log  /(P(x|C'k)P(x|C',))idx 

Matusita 

KuUback-Liebler 

/(P(x|Ck)-P(x|C,)]log^^dx 

Patrick-Fisher 

{/[P(Ck)P(xlC*)  -  P(C,)P(x|C,)]^dx}* 

iiissack-Fu 

/  1  P(Ck)P(xlCfc)  -  P(C7.)P(x|C.)  1“  Pt‘-“)(x)dx 

Kolomogorov 

/  1  P(C'k)P(x|C7k)  -  P(C,)P(x|C7,)  1  dx 

This  table  adapted  from  [8:784]  and  [14:257-258] 

Table  4.  Distance  Metrics  Based  on  Probabilistic  Dependence 


— 

Distance  Function 

Definition  R{x) 

Chemoff 

PiCu)  {  -  log  /(P(x|Ck)»P(x))(»-“)dx} 

Bhattacharyya  (Chemoff  where  a  =  .5) 

Matusita 

PiCk)  {JIVP{^\C,)  -  VP(x)J^dx}* 

Josh!  (Kullback-Liebler) 

E»=1  P{Ck)  /[i*(x|C'k)  -  P(x)]  log  ^^dx 

Patrick-Fisher 

nct)  -  p(x)i>*c}‘ 

Lissack-Fu 

Er=»  p{Ck)  s  1  p(xiCk)  -  p(x)  r  p<^~“)(x)dx 

Kolomogorov 

This  table  adapted  from  [8:785]  and  [14:261] 

dependence,  denoted  R{x),  the  better  the  feature  vector  x.  Table  4  presents  several  versions  of 
R{x)  adapted  from  Ben-Basset  [8:785]  and  Devijver  and  Kittler  [14:261]. 
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2.4-2  Selection  Methodologies.  Selection  methodologies  used  for  classification  analysis  are 
essentially  the  same  as  those  used  in  linear  regression.  Exhaustive  search  of  all  2’  - 1  feature  subsets 
for  q  candidate  features  is  usually  impractical  for  moderately  large  values  of  q.  Even  choosing  the 
best  subset  of  size  k  by  exhaustive  enumeration  is  usually  impractical,  since  it  involves  evaluation  of 
(g-t)!t!  '^he  are  a  number  of  optimal  and  suboptimal  search  algorithms  which  are  designed 

to  circumvent  an  exhaustive  search  procedure.  Most  search  algorithms  look  for  the  best  features 
by  adding  and/or  removing  features  from  the  current  feature  set.  A  forward  procedure  starts  with 
no  features  and  searches  for  the  features  in  a  “bottom  up”  manner,  whereas  a  backward  procedure 
begins  with  edl  the  candidate  features  and  searches  in  a  “top  down”  manner.  A  feature  evaluation 
metric  is  evaluated  at  each  step  of  these  algorithms  to  determine  which  features  to  add  or  remove 
at  each  step.  For  some  of  the  probabilistic  distance  metrics  displayed  in  Table  3,  computational 
savings  can  be  realized  by  exploiting  recursive  evaluation  of  the  metrics  during  “bottom  up”  and 
“top  down”  searches  [14:265-269].  Genetic  or  evolutionary  algorithms,  which  will  not  be  discussed, 
are  alternative  methods. 

Branch  and  boimd  is  an  optimal  search  algorithm  for  iinding  the  best  subset  of  size  k.  It  is 
basically  a  “top  down”  search  which  avoids  exhaustive  search  by  using  the  monotonicity  property 
that  applies  to  most  feature  evaluation  metrics  [14:207].  The  monotonicity  property  dictates  that 
the  error  from  a  set  of  features  is  never  greater  than  the  error  from  a  subset  of  those  features. 
This  princip^d  can  be  translated  to  mean  more  features  implies  more  information  which  in  turn 
implies  lower  error.  Monotonicity  may  allow  many  feature  subsets  to  be  inspected  implicitly  with 
no  additional  computation.  Devijver  and  Kittler  provide  a  detailed  presentation  of  the  branch-and- 
bound  algorithm  [14:207-214].  When  the  superset  size  k  is  approximately  |,  the  branch- and-bound 
algorithm  may  yield  substantial  potential  savings  in  computation  costs  compared  to  exhaustive 
enumeration;  however,  the  savings  are  not  as  impressive  for  very  small  k  or  k  close  to  q  [14:214].  In 
many  cases,  the  monotonicity  assumption  used  by  the  branch- and-boimd  algorithm  may  be  invalid 
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if  the  statistical  structure  of  the  prior  class  probabilities  is  not  known  [74:799].  Van  Campenhout 
discusses  the  Hughes  paradox  which  documents  the  phenomena  of  error  being  minimised  at  some 
finite  size  feature  set  [74:796-780]. 

For  many  problems,  the  optimal  branch-and-bound  algorithm  is  still  computationally  imprac¬ 
tical.  Suboptimal  search  algorithms  which  trade  off  optimality  for  computational  feasibility  can 
be  tised  for  these  problems.  With  these  algorithms,  there  is  no  guarantee  that  the  best  feature 
set  will  be  found.  Some  suboptimal  searches  are  more  sophisticated  than  others  in  two  respects: 
the  number  of  computations  required,  and  the  munber  of  feature  subsets  evaluated.  There  is  also 
no  guarantee  that  more  sophisticated  algorithms  will  yield  better  subsets  than  less  sophisticated 
zdgorithms.  The  next  paragraphs  summarize  several  suboptimal  search  algorithms. 

The  least  sophisticated  suboptimal  search  algorithm  is  to  choose  a  feature  set  of  size  k  based 
on  the  k  best  features  when  measured  independently  with  one  of  the  feature  evaluation  metrics 
discussed  previously  in  this  chapter.  Even  if  all  the  features  are  statistically  independent,  this 
method  does  not  guarantee  optimality  [8,  11, 14, 16]. 

Sequential  forwsu’d  and  backward  selection  algorithms  are  simileu*  to  those  discussed  for  linear 
regression  in  Section  2.2.3.  These  algorithms  2illow  just  one  feature  to  be  added  or  taken  away 
at  a  time.  The  main  drawback  of  forward  and  backward  sequential  selection  algorithms  is  that 
they  do  not  allow  a  feature  to  be  removed  or  added  at  a  later  point  in  the  algorithm.  Another 
drawback  is  that  the  total  number  of  features  to  be  selected,  k,  must  be  knovm  up  front,  since  k 
serves  as  a  “stopping”  mechanism.  The  algorithms  stop  when  the  current  number  of  features  in 
the  model,  p,  equals  the  number  to  be  selected,  k.  The  sequential  selection  algorithms  which  follow 
eu'e  summarized  from  Devijer’s  and  Kittler’s  presentation  [14:216-217]. 
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Fbrward  Sequential  Selection 


1.  p  =  0 

Set  the  total  number  of  features  to  select  equal  to  k 

Select  any  of  the  feature  evaluation  metrics  described  in  Section  2.4.1,  say  D(x) 

2.  Compute  D(x)  for  all  feature  subsets  of  sice  p  +  1  which  include  all  previously  selected  p 
features 

3.  Select  the  feature  set  of  size  p  +  1  which  maximizes  (minimizes)  D(x) 

4.  Set  p  =  p  +  1 

5.  If  p  <  ib  go  to  step  2 
Otherwise,  go  to  step  6 

6.  Stop,  since  k  features  have  now  been  selected 


Backward  Sequential  Selection 

1.  p=q 

Set  the  total  number  of  features  to  select  equal  to  k 

Select  any  of  the  feature  evaluation  metrics  described  in  Section  2.4.1,  say  D{x) 

2.  Compute  D(x)  for  aU  feature  subsets  of  size  p  -  1  which  do  not  include  any  previously  elim¬ 
inated  features 

3.  Select  the  feature  set  of  size  p  —  1  which  maximizes  (minimizes)  D(x) 

4.  Set  p  =  p  -t- 1 

5.  If  p  >  ib  go  to  step  2 
Otherwise,  go  to  step  6 

6.  Stop,  since  k  features  have  now  been  selected 


Generalized  sequential  forward  or  backward  selection  allow  more  than  one  feature  to  be  added 
or  taken  away  during  each  iteration  [14:217-219].  By  taking  more  than  one  measurement  into  con¬ 
sideration,  the  statistical  relationship  among  potential  features  is  partially  taken  into  consideration. 
These  algorithms  are  computationally  more  sophisticated  than  sequential  algorithms,  since  more 
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feature  subsets  must  be  evaluated  at  each  step.  These  algorithms  differ  from  the  non-generalued 
version  of  the  algorithms  primarily  in  Step  2.  Now  all  feature  subsets  of  sue  p  +  r  or  p  -  r  are 
evaluated,  where  r  is  the  number  of  additional  features  added  or  taken  away  from  the  feature  set. 
The  generalized  forward  or  backward  algorithm  still  does  not  allow  features  to  be  taken  away  or 
added  later  in  the  2dgorithm. 

Stepwise  selection  algorithms  are  sophisticated  algorithms  which  allow  one  feature  to  be 
added  and  taken  away  during  each  iteration  of  the  algorithm.  These  algorithms  partially  take  into 
consideration  the  statistical  relationship  between  feature  subsets.  Stepwise  selection  algorithms 
require  more  computation  than  strictly  sequential  algorithms.  The  forward  sequential  algorithm 
can  be  adapted  to  perform  forward  stepwise  selection  as  follows. 

Forward  Stepwise  Selection 

1.  p  =  0 

Set  the  total  number  of  features  to  select  equal  to  k 

Select  any  of  the  feature  evaluation  metrics  described  in  Section  2.4.1,  say  D{x) 

2.  Compute  D(x)  for  all  feature  subsets  of  size  p  +  1  which  include  all  previously  selected  p 
features 

3.  Select  the  feature  set  of  size  p  +  1  which  maximizes  (minimizes)  l?(x) 

4.  Compute  D{x.)  for  all  feature  subsets  of  size  p  which  include  previously  selected  p+ 1  features 

5.  If  feature  subset  of  size  p  which  meizimizes  D{x)  is  not  the  same  subset  of  size  p  from  step  2, 
then  retain  only  those  p  features  and  go  to  step  2 

Otherwise,  go  to  step  6 

6.  Set  p  =  p  +  1 

7.  If  p  <  ib  go  to  step  2 
Otherwise,  go  to  step  8 

8.  Stop,  since  k  features  have  now  been  selected 

Stepwise  selection  algorithms  can  also  be  generalized  to  allow  for  more  than  one  feature  to  be 
added  or  t2d(en  away  (or  possibly  both  added  and  taken  away)  at  each  iteration  of  the  algorithm 
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[15:220-223].  Further  generalization  allows  the  number  of  features  added  or  taken  away  to  vary  at 
each  iteration  within  the  algorithm. 

The  last  three  sections  have  reviewed  feature  selection  topics  related  to  classical  linear  and 
nonlinear  regression  and  discriminant  analysis.  The  next  section  reviews  the  work  which  has  been 
done  for  feature  selection  specifically  in  the  context  of  neural  networks. 

2.5  Feedforward  Neural  Networks 

In  this  section,  techniques  designed  to  evaluate  and  select  features  within  a  neural  network 
framework  are  reviewed.  Notational  conventions,  network  structure,  and  the  back  propagation 
learning  algorithm  are  reviewed  in  Chapter  I  for  neural  networks. 

2.5.1  Feature  Saliency  Metrics.  In  this  section,  the  established  feedforward  neural  network 
feature  metrics  are  described.  The  single  hidden  layer  neural  network  is  displayed  again  for  reference 
in  Figure  3. 

The  probability  of  error  measure,  Fe(3c),  introduced  in  Section  2.4,  Equation  16,  is  often 
used  as  a  bench  mark  for  independently  evaluating  the  usefulness  of  neural  network  features  for 
discriminant  analysis  problems  [51,  59,  60].  Fe(x)  is  defined  again  here  for  reference 

Pe(x)  =  Ex[l  -max{P(Ci|x),  -  -,P(Ck|x)}], 

where  E{-}  is  the  expectation  function  with  respect  to  x,  max{-}  is  the  maximization  function, 
and  P(Cfcjx)  is  the  posterior  probability  of  class  k.  The  probability  of  error  Pe(x)  is  approximated 
with  a  feedforward  neural  network  when  appropriate  conditions  are  met  (discussed  in  Chapter  I 
Section  3).  Therefore,  the  net’s  approximation  to  Ps(3c)  can  also  be  used  to  independently  evaluate 
the  usefulness  of  neural  network  features. 
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Figure  3.  Single  Hidden  Layer  Feedforward  Neural  Network 
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The  network’s  approximatian  can  be  defined  as 

A(x,  w)  =  P-^  5^  II  -  max{*i(x,  w),  •  •  •,zjc(x,  w)} ,] 
r=i 

where  P  is  the  number  of  data  ezonplars  x  in  the  data  set  and  Zk{x,ir)  is  the  network  output  for 
the  jbth  class.  The  network  approximation  to  probability  of  error  is  computed  for  the  feature  vector 
X  and  the  estimated  network  weight  parameters  w.  Alternatively,  Pe{x,  w)  can  be  used  to  evaluate 
a  subset  of  features  with  the  network.  In  this  case,  the  correlations  between  specific  feature  inputs 
considered  are  taken  into  account. 

Le  Cun  and  others  propose  a  saliency  metric  for  evaluating  features  or  middle  nodes  which 
is  based  on  second  derivative  information  [39].  They  construct  a  local  model  of  the  network  error 
function  using  a  Tbylor  series  approximation  of  the  network’s  error  (see  Appendix  A  for  details) 
Then,  the  network’s  change  in  error  due  to  weight  changes  is  approximated.  In  order  to  make  it 
computationally  practical  to  evaluate  the  Taylor  series  expression  for  the  change  in  error,  three 
simplifying  assumptions  are  used: 

1.  Assume  the  Taylor  series  is  evaluated  at  a  local  minimum  of  the  error  which  makes  the  first 
order  terms  equal  to  zero. 

2.  A  diagonalizing  assumption  is  used  to  eliminate  all  cross  terms  of  the  Hessian  matrix. 

3.  A  quadratic  approximation  assumption  about  the  error  surface  is  used  which  implies  that  3rd 
order  and  higher  order  terms  are  negligible. 

All  that  remains  are  the  second  order  diagonal  terms,  which  are  assmned  positive  at  a  lo¬ 
cal  minimum.  Therefore,  a  perturbation  of  the  vector  of  estimated  wei^t  parameters  w,^  = 
{^tii " '  >  }  associated  with  feature  «  should  cause  the  error  to  either  increase  or  to  stay  the 

same.  For  a  trained  network,  the  saliency  metric  for  feature  node  t  developed  by  Le  Ctm  and 
others  uses  the  second  derivative  of  network  output  error  with  respect  to  the  vector  of  associated 
feature  input  weights.  The  network  output  error  is  defined  as  the  squared  output  error  So  defined 
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F=l»=l 

The  second  order  saliency  metric  for  feature  t,  denoted  s^,  is  defined  as 

*i  =  hii-^ 

where  denotes  the  pth  input  exemplar,  and 

.  f  9^{£o) 

■  h 

As  defined  above,  Sj  corresponds  to  the  saliency  of  feature  t  for  a  single  input  exemplar.  When  the 
metric  Si  is  averaged  over  the  entire  training  set,  the  result,  denoted  Si,  is  given  as 

=  *<(*<) 
r=i 

where  P  is  the  number  of  training  vectors,  and  Si(x^)  indicates  that  Sj  is  a  function  of  the  pth 
input  vector  x'.  A  more  detailed  derivation  of  s<  and  is  shown  in  Appendix  A,  Section  A.l. 

Another  technique  for  measuring  the  saliency  or  relevance  of  a  neural  network  unit,  such 
as  a  feature,  is  proposed  by  Moser  and  Smolensky  [45].  Moser  and  Smolensky  propose  that  the 
relevance  of  a  feature  be  measured  as  a  function,  pi,  of  how  well  the  network  performs  with  the 
unit  versus  how  well  the  network  performs  without  the  unit,  i.e. 

Pi  —  ^wllfcoml  i  mall  • 

They  propose  approximating  Pi  by  examining  the  derivative  of  the  error  with  respect  to  a  relevance 
factor  coefficient,  a^.  The  relevance  factor  coefficient,  Oi,  represents  the  attentional  strength  of  a 
unit. 
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This  coefficient  can  be  thought  of  as  gating  the  flow  of  activity  from  the  unit: 
Oj  =  fi^iVOjiOiOi),  where  Oj  is  the  activity  of  unit  j,  Wji  the  connection  strength 
to  j  from  t,  and  /  the  sigmoid  squashing  function.  If  =  0,  unit  t  has  no  influence 
on  the  rest  of  the  network;  if  Oj  =  1,  unit  t  is  a  conventional  unit.  [45:109] 

In  terms  of  Oi,  Pi  can  be  defined  as 

Pi  —  ^a,-=0  ~  ^<*1=1 

and  Mozer  and  Smolensky  approximate  Pi  with: 


The  derivation  and  notational  details  of  pi  are  shown  in  Appendix  A,  Section  A.2. 

Ruck  describes  a  feature  saliency  metric  which  measures  feature  t’s  effect  on  a  neural  network’s 
output  [59].  The  metric  attempts  to  capture  the  total  of  the  partial  derivatives  of  the  network’s 
outputs  with  respect  to  the  entire  M-dimensional  feature  space  Tl** .  Ruck’s  saliency  metric  for 
feature  t  is  built  from  the  exact  partial  derivatives  of  network  outputs,  zt,  with  respect  to  feature 
inputs  Xi  using  a  trsdned  network. 

Ideally,  the  input  space  would  be  systematically  sampled  over  its  entire  range  of  values  [59:34]. 
If  R  points  were  used  for  es^di  input,  the  total  number  of  derivatives  would  be  on  the  order  of 
where  M  is  the  number  of  feature  inputs.  For  other  than  very  small  problems,  R**  is  computa¬ 
tionally  impractical.  Ruck  proposes  a  sampling  method  which  is  computationally  practical.  For 
every  training  vector,  each  feature  input  is  sampled  over  its  range  while  the  other  feature  inputs 
are  fixed  as  determined  by  the  actual  training  vector  being  evaluated.  For  P  training  vectors,  the 
number  of  derivative  evaluations  is  KPRM,  where  K  is  the  ntimber  of  output  classes  represented 
by  the  net,  M  is  the  number  of  features,  and  R  is  the  number  of  samples  for  each  feature  input  of 
esudi  training  vector.  For  the  saliency  computation  of  each  feature,  the  set  of  “pseudo”  data  points 
remains  the  same.  Following  Reinhart’s  notation  [53:21-22],  define  as  the  vector  of  R  uniformly 
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spaced  pseudo  points  covering  the  range  of  the  mth  input  feature.  The  rth  ccnnponent,  <4,  of  d 
can  be  defined  as: 


d,=mm..  +  (r-l) - — -  r=1.2.  ■,« 


The  Ruck  saliency  metric  for  feature  t,  Aj,  is  ddined  as 


(17) 


P  U  R  K 


]i=lm=lr=lSz:l 


(18) 


where  P  is  the  number  of  training  vectors  x;  M  is  the  number  of  features;  R  is  the  number  of 
uniformly  spaced  points  covering  the  range  of  each  input  feature  found  in  the  training  set;  K  is  the 
number  of  output  classes;  the  vector  x^,^  is  the  pth  exemplar  x'  with  its  mth  component  replaced 
by,  dr  the  rth  component  of  d^;  and  w)  indicates  that  the  derivative  is  evaluated  with  the 

feature  vector  x^(,)  and  the  final  estimates  of  the  trained  network  weight  parameters  w.  Also,  the 
absolute  value  of  the  derivatives  are  used;  therefore,  positive  and  negative  derivative  changes  do 
not  cancel  each  other.  Empirical  results  indicate  that  A<  provides  similar  rankings  to  [59:46]. 


Equation  18  has  been  modified  from  Ruck’s  original  presentation  to  reflect  that  for  each 
vector  there  are  PRM  function  evaluations  of  the  network’s  sensitivity  |^(x,  w)|  as  Ruck 
intended,  rather  than  PR  evaluations  as  suggested  by  Ruck’s  notation  [59,  62]. 

Priddy  illustrates  a  relationship  between  the  class  specific  Bayesian  probability  of  error  and 
the  derivative-based  feature  saliency  metric,  Aj  [51].  The  relationship  relies  on  the  assumptions 
necessary  for  feedforward  neural  networks  to  approximate  a  Bayes  optimal  discriminant  function 
[30, 54, 61, 68, 75]  [41:50].  When  these  assumptions  axe  met,  the  trained  feedforward  neural  network 
output  Zk  can  be  interpreted  in  the  limit  as  P((7k|x),  which  is  the  posterior  probability  of  class  k 
for  X.  Since  ^(^fcl^)  =  l-i  neural  network  approximation  to  the  class  specific  probability 
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of  error  for  class  k  is  defined  u 


P,(*,x,w)  =  l-z*(x,w) 

}*i‘ 

Using  Ruck’s  method  of  feature  space  sampling,  Priddy  suggests  a  Bayesian-based  saliency  metric, 
Oj.  It  is  defined  using  the  modified  notation  shown  in  Equation  18  as  [51]: 


,  (19) 

p=lm=:lr=:lk=l 

where  P  is  the  number  of  training  vectors  x;  M  is  the  number  of  features;  R  is  the  number  of 
uniformly  spaced  points  covering  the  range  of  each  input  feature  found  in  the  training  set;  K  is  the 
number  of  output  classes;  the  vector  is  the  pth  exemplar  x**  with  its  mth  component  replaced 
by,  dr  the  rth  component  of  dm  defined  in  Equation  17;  and  (3^(r)>  w)  indicates  that  the  derivative 
is  evaluated  with  the  feature  vector  ^^r)  snd  the  final  estimates  of  the  trained  network  weight 
parameters  w. 

In  neural  network  terms,  f),-  is  defined  as 

P  M  R  K  K 
p=l  TO=1  r=l  fc=l  l^k 

Using  the  triangle  inequality,  Priddy  shows  that  (If  is  bounded  above  by  a  simplified  saliency  metric 


Uj,  where  [51]: 


P  M  R  K  K 

ni  =  EEEEE 

pzzl  fn=:l  r=l  k=t  l^k 


Priddy  shows  (li  is  a  scalar  mtdtiple  of  the  Aj,  i.e. 


Cli  =  {K-  1)A<, 
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where  K  is  the  number  of  output  dasses  [51]. 

Since  Priddy’s  metric  is  developed  from  Ruck's  metric,  Equations  19,  20,  and  21  have  been 
modified  from  Priddy’s  original  presentation.  This  refiects  PRM  or  (K  —  1)PRM  sensitivity 
function  evaluations  (i.e.  for  each  vector  similar  to  Equation  18,  rather  than  PR  or 

(R  -  1)PR  function  evaluations  as  denoted  in  Priddy’s  original  notation  (see  discussion  on  page  51) 
[51]. 

Tarr  suggests  using  a  weight-based  metric,  T{,  for  measuring  feature  t’s  saliency  [72].  This 
metric  depends  only  on  the  training  exemplars  used  in  training  the  neural  network.  The  weight 
saliency  is  defined  as 

Ti  =  (22) 

i 

where  denotes  the  estimated  weight  parameter  connecting  input  feature  i  to  hidden  node  jf 
[72:45].  The  idea  is  that  the  weights  emanating  from  important  features  would  grow  the  most;  the 
weights  emanating  &om  less  important  features  wotdd  grow  less;  and  the  weights  emanating  from 
unimportant  features  would  fluctuate  up  and  down  about  zero.  The  effectiveness  of  Tj  depends  on 
two  things  [72:45]. 

1.  The  weight  parameters  must  be  from  a  trained  neural  network  of  the  appropriate  complexity. 

2.  The  vectors  of  input  feattires  must  be  normalized  to  about  the  same  range. 

Computationally,  this  metric  is  much  simpler  than  the  other  available  feature  saliency  metrics. 
Steppe  reports  the  Minkowskii  (taxi-cab)  and  Minkowskioo  (infinity)  norms  of  the  weights  provide 
similaT  feature  rankings  to  Terr’s  weight  saliency  which  is  defined  using  the  squared  Minkowski2 
norm  [70].  Tarr  reports  that  Tj  provides  rankings  similar  to  Ai  [72:49]. 

2.5.2  Sensitivity  Analysis  Procedures.  Sensitivity  analysis  procedures  for  feedforward  neural 
network  feature  inputs  are  related  to  feature  selection.  This  section  reviews  sensitivity  anedysis 
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procedures  for  neural  network  feature  inputs  using  ezut  first  and  second  order  partial  derivatives 
[21,  22,  78],  and  estimated  first  order  partial  derivatives  [33]. 

Werbos  stuxunarizes  a  collection  of  algorithms  involving  differentiation  and  cost  minimisation 
[78].  These  algorithms  are  all  variations  of  neural  networks  using  a  form  of  error  backpropagation  for 
cost  minimisation.  The  first  partial  derivatives  are  commonly  used  in  sensitivity  analysis.  Farther 
information  is  also  provided  by  the  second  partial  derivatives,  The  second  partials  can 

be  used  for  imderstanding  the  effect  of  feature  interactions  on  model  dynamics  [78:766].  Werbos 
recommends  calctilating  exact  first  and  second  order  derivatives  of  the  neural  network  at  each  data 
point.  According  to  Werbos,  exact  partial  derivatives  are  more  accurate  and  easier  to  compute 
than  attempting  to  estimate  the  partial  derivatives  using  input  variable  perturbation  [78:764,766]. 

Hashem  presents  mathematical  expressions  of  first  and  second  order  neural  network  feature 
sensitivities  [22].  The  expressions  are  generalised  for  a  neural  network  with  more  than  one  hidden 
layer  which  has  sigmoidal  activation  functions  on  the  hidden  and  output  layers.  Hashem  computes 
feature  sensitivities  using  exact  first  and  second  partial  derivatives  of  zt  with  respect  to  «<,  i.e. 
Ix^  and  For  approximating  a  two  dimensional  function  g(x)  =  sin(4z)  with  x  e  [-1)1], 

Hashem  demonstrates  how  the  exact  partial  derivatives  of  the  neural  network  correspond  closely 
to  the  exact  partial  derivatives  of  the  true  function.  Hashem  does  not  present  methodologies  for 
analyzing  or  summarizing  these  sensitivities  for  studying  features. 

Guo  and  Urig  studied  a  neural  network  model  of  nuclear  power  plant  thermal  performance 
data  to  identify  the  wiables  which  strongly  affect  the  heat  rate  [21].  In  their  study,  they  calculated 
a  feature’s  sensitivity  in  a  global  or  average  sense.  For  each  feature,  the  absolute  values  of  the  neural 
network  partial  derivatives  of  output  k  with  respect  to  feature  *,  |  ^  |,  are  averaged  over  all  known 
exempleirs.  This  metric  differs  from  Ruck’s  metric  in  three  ways.  First,  only  the  training  data  are 
used.  Second,  it  represents  an  average  versus  a  summation  over  the  feature  space.  Third,  it  only 
examines  sensitivities  with  respect  to  one  output  at  a  time  rather  than  summing  over  all  outputs. 
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Guo  and  Urig  use  their  metric  to  rank  order  features  with  respect  to  their  sensitivities  for  a  specific 
output.  They  do  not  suggest  improving  network  performance  by  selecting  a  reduced  set  of  the  most 
sensitive  variables  according  to  their  sensitivity  metric.  However,  they  do  suggest  doing  another 
level  of  sensitivity  analysis  to  determine  the  input  variables  which  strongly  affect  the  most  sensitive 
inputs  [21:457].  To  determine  a  second  level  of  sensitivity,  a  neural  network  is  trained  to  predict  the 
most  sensitive  feature  variable  using  the  remaining  feature  vju'iables  as  network  inputs.  Then  the 
sensitivities  are  determined  as  before  using  Guo  and  Urig’s  sensitivity  metric  for  the  neural  network. 
Guo  and  Urig  propose  that  the  information  about  sensitive  input  variables  used  by  plant  personal 
to  determine  which  efforts  will  be  most  effective  in  improving  nuclear  power  plant  efficiency  [21]. 

Klimasauskas  investigates  the  impact  of  small  changes  in  feature  inputs  on  the  neural  net¬ 
work  output  activations  using  estimated  first  order  parti2d  derivatives  [33].  When  measuring  the 
sensitivity  of  feature  s  on  neural  network  outputs,  Klimasauskas  “samples”  two  additional  unknown 
exemplars  for  an  input  exemplar  which  are  used  for  estimating  partial  derivatives  of  the  neural  net¬ 
work  model.  The  original  exemplar  remams  the  same  except  that  it  is  perturbed  a  small  amount 
above  and  below  the  value  of  feature  t.  For  each  known  exemplar,  the  partial  derivative  of  with 
respect  to  Xi  is  estimated  by  taking  the  difference  of  the  trained  neural  network’s  output  at  the 
two  additional  exemplar  samples  divided  by  the  total  change  in  feature  t  for  the  two  additional 
exemplars.  Klimasauskas  estimates  the  partial  derivatives  of  the  neural  network  output  rather  than 
using  exact  partizd  derivatives  of  the  trained  neural  network  output  as  Werbos  does  [78].  Using 
two-dimensional  plots,  Klimasauskas  studies  the  estimated  derivatives  over  the  entire  feature  space 
for  sensitivity  stnalysis.  Two-dimensional  plots  display  the  numerical  value  of  feature  t’s  partial 
derivatives  for  each  input  exemplar  with  the  x  and  y  axes  being  feature  i’s  and  j’s  values  [33:22]. 
For  a  simple  problem,  the  derivatives  are  significantly  different  than  zero  only  at  the  classification 
borders  where  the  neural  network  output  transitions  ficom  one  classification  state  to  another  [33:22]. 
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2.5.3  Selection  Methodologies.  Exhaustive  enumeration  of  the  feature  subsets  can  be  ac¬ 
complished  using  either  a  prediction-type  error  or  a  P^-type  criterion  depending  on  the  neural 
network  application.  However,  this  method  becomes  impractical  in  a  situation  with  more  than  a 
few  variables.  It  is  impractical  for  two  reasons:  the  large  number  of  enumerated  subsets,  and  the 
computationed  requirements  of  training  a  neural  network. 


Practical  feature  selection  methodologies  available  for  neural  networks  can  be  divided  into 
three  broad  categories.  The  first  category  selects  the  k  best  features  using  one  of  the  feature  saliency 
metrics  described  earlier.  The  second  category  involves  screening  a  feature  set  for  “noise”  features. 
The  third  category  involves  hypothesis  testing  for  the  presence  of  irrelevant  features. 

The  first  category  of  selection  methodologies,  selecting  the  k  best  features,  includes  several 
of  the  feature  metrics  described  in  Section  2.5.1.  For  these  metrics,  the  k  best  features  are  selected 
according  to  the  metric  rankings,  although  no  guidance  or  procedure  accompanies  the  metrics  for 
determining  k  [39,  59,  51,  72]. 


In  the  second  category  of  selection  methodologies,  Belue  and  Bauer  offer  a  feature  screening 
technique  for  identifying  a  “noise”  feature  [5].  Their  procedure  requires  adding  a  noise  feature  into 
the  original  set  of  features.  The  neural  network  is  treuned  with  the  augmented  set  of  features,  and 
the  saliency  of  {dl  features  is  computed  using  either  the  or  T^  metric.  The  training  and  saliency 
computation  is  repeated  many  times  with  remdomized  initial  weights  in  order  to  characterize  the 
saliency  distribution  of  the  noise  feature.  Belue  and  Bauer  recommend  selecting  only  the  k  features 
whose  mean  saliency  falls  outside  a  one-sided  confidence  interval  for  the  mean  saliency  of  the  noise 


[5]. 


White’s  irrelevant  input  hypothesis  test  (for  neural  networks  trained  with  backpropagation) 
falls  into  the  third  category  of  feature  selection  methodologies  [82].  The  hypothesis  testing  method¬ 
ology  requires  the  computation  of  a  chi-squared  test  statistic  to  identify  when  a  vector  of  weights 
connected  to  an  feature  input  are  irrelevant  (i.e.  can  not  be  rejected  as  statistically  different  than 
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zero).  White’s  hypothesis  test  is  expressed  as 


Hq  :  Sw*  =  0, 


where  S  is  a  g  x  s  selection  matrix  picking  out  the  g  elements  of  the  s  x  1  vector  of  neural  network 
optimal  weights  w*  which  are  hypothesized  to  be  zero  under  Hq.  When  Hq  is  true,  the  g  elements 
of  the  estimated  weight  vector  w,  selected  by  Sw,  are  typically  weights  which  are  small  in  absolute 
magnitude. 

White’s  irrelevant  input  hypothesis  test  requires  that  the  limiting  distribution  of  w  as  it 
converges  to  w*  is  a  multivariate  normal  distribution  [79].  The  limiting  distribution  of  w  will 
be  multivariate  normally  distributed  in  the  limit  if  the  redundant  inputs  and/or  irrelevant  hidden 
units  are  removed  [79:441].  When  w  has  a  multivariate  normal  limiting  distribution,  then 

V^(w-w’)  ~  1V.(0,C'), 

where  P  is  the  number  of  data  exemplars  [79].  White  shows  that  the  multivariate  normal  distribu¬ 
tion  of  y/P(yr  —  w*)  implies  the  following  is  true  [79]: 

-/PS(w-w*)  ~  N,[0,SC*S’) 

When  Ho  is  true,  then  Sw*  =  0  [79].  This  implies 

VPSrr  ~  i7,(0,SC*S'), 

and  therefore 

Pw'S'(SC*S')-^Sw  ~  xj  (23) 
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An  analytical  expression  for  C*  is  not  avidlable,  but  an  estimator  C  exists  which  is  weakly  consis¬ 
tent,  where 

C  =  A-'BA-'  (24) 

The  matrices  A  and  B  are  defined 

A  =  P-^2V*f(x^w) 

i>=i 

p 

B  =  p-'5];Vf(x',w)Vf(x',w)' 

j*=i 

where  £  is  the  neural  network  error  used  for  training,  and  the  operators  V  and  denote  the  s  X  1 
gradient  and  the  s  x  s  Hessian  operators  of  €  defined  with  respect  to  w,  where  x^  is  the  pth  input 
exemplar,  x^.  Beplacing  C*  with  C  does  not  affect  the  limiting  distribution.  However,  White 
warns  that  sometimes  a  much  larger  sample  size  P  is  required  to  obtain  a  good  approximation 
of  C*  [79:442-443].  The  test  statistic  defined  in  Equation  23  is  used  to  test  Ho  at  a  desired 
accturacy  of  1  -  a.  Whenever,  the  test  statistic  exceeds  the  1  -  a  percentile  of  xj>  the  irrelevant 
input  hypothesis  is  rejected.  The  probability  of  failing  to  reject  Ho  when  Hq  is  false  is  equal  to  a. 

2.6  Summary 

Feature  evaluation  metrics  and  feature  selection  techniques  developed  for  linear  regression, 
nonlinear  regression,  discriminant  analysis,  and  feedforward  neural  networks  are  surveyed  in  the 
body  of  this  chapter.  In  the  remainder  of  the  dissertation,  research  restdts  lire  presented  which  in 
some  cases  were  derived  from  the  body  of  knowledge  surveyed  in  this  chapter. 

In  Chapter  HI,  improved  neural  network  feature  saliency  metrics  are  introduced  along  with  a 
catalogue  of  all  the  available  saliency  metric  definitions  and  relationships.  A  good  number  of  these 
metrics  are  survej^a  in  Section  2.5.  The  saliency  screening  technique  and  the  irrelevant  input  hy¬ 
pothesis  test  reviewed  in  Section  2.5  are  used  for  the  results  presented  in  Chapter  IV.  To  determine 
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a  practical  model  selection  criteria  for  neural  network  selection,  the  linear  and  nonlinear  regression 
model  selection  criteria  from  Sections  2.2  and  2.3  are  studied.  The  sequential  selection  algorithms 
from  linear  regression  and  discriminant  analysis  are  the  basis  for  the  backwards  sequential  algorithm 
used  in  the  procedures  developed  in  Chapters  V  and  VI. 
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III.  Feedforward  Neural  Network  Feature  SoHency  Metrics 


3.1  Introduction 

Feature  saliency  metrics  are  used  for  evaluating  and  ranking  individual  features  within  a 
neural  network.  In  this  research,  the  terminology  metric  is  used  to  refer  to  a  measure  of  feature 
importance.  The  results  shown  in  this  chapter  are  important  for  three  reasons.  First,  new  and 
improved  feature  saliency  metrics  are  presented.  Second,  saliency  metric  sensitivities  to  sampling, 
training,  and  redundant  middles  nodes  are  documented.  Third,  a  catalogue  of  feature  saliency 
metric  definitions  and  interrelationships  is  presented  which  consolidates  the  set  of  available  neural 
network  feature  saliency  metrics.  An  overview  of  this  chapter  follows. 

In  Section  2  of  this  chapter,  a  framework  for  understanding  derivative-based  saliency  is  dis¬ 
cussed.  Several  variations  of  derivative-based  saliency  are  investigated,  including  a  known  data 
metric  which  requires  fewer  derivative  evaluations  than  an  established  metric.  The  saliency  metrics 
are  evaluated  for  their  sensitivities  to  sampling,  training,  and  redundant  middle  nodes.  In  Section 

3,  a  mathematical  relationship  is  derived  between  derivative-based  saliency  and  the  weight-based 
saliency.  Pe-beised  feature  saliency  metrics  and  related  research  results  are  discussed  in  Section 

4.  These  results  include  the  illustration  of  a  precise  relationship  between  an  established  Fg-based 
metric  and  an  established  derivative-based  metric,  the  introduction  of  a  new  Pg-based  feature 
saliency  metric,  and  derivation  of  relationships  between  the  new  Pg -based  metric  and  the  improved 
derivative-based  metric.  In  Section  5,  a  catalogue  of  definitions  and  theoretical  relationships  amor.g 
the  set  of  available  feature  saliency  metrics  is  presented.  Also  in  Section  5,  feature  saliency  results 
are  documented  on  a  ‘real  world’  problem  for  the  set  of  available  feature  metrics.  The  results 
presented  in  this  chapter  are  stunmarized  in  Section  6. 

The  research  and  theoretical  results  presented  in  this  chapter  reflect  the  exclusive  use  of  the 
sigmoidal  activation  functions  on  the  middle  and  output  nodes  of  a  feedforward  neural  network 
(presented  in  Chapter  1).  The  fundamental  network  output  and  network  derivative  definitions  will 
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change  for  other  types  of  activation  functions.  However,  similar  the  underlying  concepts  and  rela¬ 
tionships  shown  throughout  this  chapter  will  still  hold.  The  neural  network  notational  conventions, 
network  structure,  and  backpropagation  algorithm  introduced  in  Section  3  of  Chapter  I,  as  well 
as  the  feature  saliency  notation  reviewed  in  Section  4  of  Chapter  H,  are  used  as  necessary  in  this 
chapter. 

3.2  Derivative-based  Feature  Saliency  Metrics 

In  this  section,  neural  network  partial  derivatives  and  related  notations  are  reviewed.  Then, 
an  integrated  saliency  metric  is  introduced  as  a  ‘truth  model’  for  derivative-based  saliency.  Sev¬ 
eral  approximations  for  integrated  saliency  are  discussed  since  the  integrated  saliency  metric  is 
intractable  for  most  problems.  The  first  approximation  is  ar  established  saliency  metric  proposed 
by  Ruck  which  involves  evaluating  the  saliency  with  what  could  be  called  ‘pseudo-samples’  &om 
the  feature  space  [60].  The  second  approximation  is  similar  to  Ruck’s,  but  the  saliency  is  eval¬ 
uated  with  random  samples  from  the  feature  space.  The  third  approximation  is  also  similar  to 
Ruck’s,  but  the  saliency  is  evaluated  with  only  the  known  data  from  the  feature  space.  AU  of  these 
derivative-based  saliency  metrics  are  analyzed  for  their  sensitivity  to  sampling,  training  length,  and 
redundant  middle  nodes.  Finally,  a  summzuy  of  the  derivative-based  saliency  results  is  presented. 

3.2.1  Background.  The  importance  of  an  input  feature  is  a  function  of  the  network’s  sen¬ 
sitivity  to  changes  in  the  input  feature  [21,  22,  33,  59,  78].  Evaluating  the  network’s  sensitivity 
to  the  ith  feature  input  is  analogous  to  evaluating  partial  derivatives  of  the  network  output  with 
respect  to  the  ith  feature  input. 

A  slight  digression  is  necessary  to  review  some  of  the  neural  network  notation  related  to 
evaluating  partial  derivatives.  Let  the  input  features  Xi  be  indexed  from  i  =  0,  •  ■  • ,  Af ,  the  middle 
node  activations  zj  be  indexed  from  j  =  0,  •  -  • ,  if,  and  thr  t  node  activations  be  indexed  from 
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k  =  1,-  •  -fK.  Let  /(a)  represent  the  sigmoidal  nonlinear  activation  function  defined  as: 


When  sigmoidal  activation  units  are  used  on  the  middle  and  output  nodes,  the  feedforward 
neural  network  output  and  hidden  node  activations  and  associated  specialised  terms  are  defined  as 


follows: 


=  A(]C  *">**>) 

j=i 

=  Zfc(l  -  Zk) 

•=1 


where  xq  and  Xq  are  bias  terms  which  are  equal  to  one,  is  an  estimate  of  the  weight  parameter 
connecting  the  jth  middle  node  with  the  kth  output,  estimates  the  weight  parameter  connecting 
the  «th  feature  input  with  the  jth  middle  node.  Now,  applying  partial  differentiation  to  with 


respect  to  Xi  gives: 


The  definitions  for  z^,  zj,  and  are  significantly  different  when  sigmoidal  activation  functions 
are  not  used. 

In  Figure  4,  there  are  three  examples  of  classification  problems  displayed.  These  three  ex¬ 
amples  encompass  the  range  from  output  classes  not  overlapping  (Example  1)  to  output  classes 
significantly  overlapping  (Example  3).  For  each  example  shown  in  Figure  4,  a  neural  network  was 
trained  on  200  training  vectors  using  two  output  nodes  and  one  middle  node  with  a  step  size  of  0.3 
and  a  momentum  of  0.7.  lYaining  was  discontinued  when  the  training- test  set  error  was  minimized. 
This  occurred  at  two,  five,  and  ten  epochs  for  the  three  examples. 
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In  the  first  row  of  Figure  4,  the  true  likelihood  function  of  the  data  in  each  class  is  shown.  For 
all  three  examples,  the  true  underlying  distribution  fimction  is  the  normal  distribution  function, 


denoted  h(x),  which  is 


where  /i  and  a  are  the  expected  value  and  standard  deviation  of  z. 


In  the  second  row  of  Figure  4,  the  a  posterior  distribution  function  of  x  is  shovm.  The  a 
posterior  distribution  function  for  the  kth  class,  denoted  P(Ct|x),  is  defined  from  Bayes  rule  as 


P{C,\x) 


P(COht(x) 


where  P(Cj)  denotes  the  prior  probability  of  class  k  and  hfc(x)  denotes  the  likelihood  function  for 
class  k. 


In  the  third  row  of  Figure  4,  the  neural  network’s  output  function  for  class  one  is  shown.  In 
this  network,  the  outputs  firom  each  class  can  be  interpreted  as  an  approximation  to  the  a  posterior 
distribution  for  x  (see  discussion  Section  3  of  Chapter  1).  Notice  in  Example  1,  the  neural  network 
output  for  class  one  is  a  poor  approximation  to  the  a  posterior  distribution  when  the  classes  are 
not  overlapping.  However,  when  the  tails  of  the  two  classes  do  overlap  in  Examples  2  and  3,  the 
neural  network  outputs  are  better  approximations  to  the  a  posterior  probabilities. 

In  the  fourth  row  of  Figure  4,  the  absolute  value  of  the  neural  network’s  feature  ‘sensitivity 
fimction’  is  shown.  For  these  univariate  two-clsiss  examples,  the  ‘saliency  function’  is 


E 


k=l 


dzk{x,  w) 
dxi 


7 


where  Zfc(x,  w)  indicates  that  Z/,  is  evaluated  with  the  univariate  feature  vector  x  and  the  vector 
of  estimated  weight  parameters  w.  Notice,  the  neural  network’s  maximum  feature  sensitivity 
corresponds  to  the  classification  borders  where  the  neural  network’s  output  is  equal  to  ^ .  For  the 
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second  and  third  examples,  the  region  where  the  network  is  most  sensitive  to  the  features  is  also  the 
region  where  the  true  a  posterior  distribution  is  most  sensitive.  However,  in  Example  one  where 
the  likelihood  distributions  are  not  overlapping,  the  most  sensitive  regions  of  the  feature  saliency 
and  the  a  posterior  distribution  do  not  correspond. 


3.2.2  ‘Truth  Model:’  Integrated  Feature  Sodiency.  A  comprehensive  measure  of  derivative- 
based  feature  saliency  is  defined  as  the  expected  feature  sensitivity  integrated  over  the  entire  feature 
space.  For  the  univariate  discrimination  problems  shown  in  Figure  4,  this  metric  entails  integrating 
under  the  'saliency  function’  curves  shown  in  the  fourth  row.  Although  the  results  shown  in  this 
section  use  notation  and  definitions  specific  to  feedforward  neural  networks  with  sigmoid  activa¬ 
tion  functions,  this  framework  can  also  be  applied  to  neural  networks  defined  with  other  types  of 
activation  functions. 


For  the  second  and  third  examples  shown  in  Figure  4,  the  region  of  integration  is  representative 
of  where  the  true  data  exists.  However,  for  the  first  example,  the  region  of  integration  includes  a 
portion  of  the  curve  where  the  likelihood  of  being  in  either  class  is  zero.  This  portion  of  the  curve 
corresponds  to  large  values  of  the  'saliency  function,’  yet  poor  approximations  to  the  a  posterior 
distribution. 


Define  integrated  feature  saliency  as  the  average  value  of  the  saliency  function  /i(x,w)  = 
^t=i  1^1  feature  space  region  [67:179].  Let  the  integrated  feature  saliency,  denoted  Aj, 

be  given  as 


A,  =  /  / 

In*' 


ti  (».*)=£“=,  I  lif  (*.♦)  I 


/id/idV' 


(26) 


where  represents  the  M-dimensional  feature  space  region  (i.e.  /,,  •  •  •  and  dV"  = 

represents  the  total  saliency  'volume’  which  is  given  as  [67:187] 


V,=  / 

Jn^  J/j(x,«r)=0 


dfi  dn“, 


(27) 
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The  limits  of  integration  on  each  feature  correspond  to  the  observed  range  of  the  data  inputs.  The 
absolute  value  of  the  derivatives  are  measured  to  ensure  that  the  positive  and  negative  derivatives 
do  not  cancel  each  other. 

A  number  of  numerical  methods  can  be  used  to  evaluate  Equation  26.  These  methods  are 
good  approximations  for  smooth  regions  with  no  pockets  of  highly  peaked  regions  [50:].  With  most 
numerical  integration  methods,  a  discrete  number  of  function  evaluations  are  required  which  is 
dependent  on  the  number  of  features  M  and  the  number  of  evaluation  points  R  for  each  feature 
dimension.  The  number  of  function  evaluations  increases  exponentially  with  M .  This  means  that 
on  the  order  of  R**  function  evaluations  will  be  needed.  For  example,  a  10  point  Gauss-Legendre 
integration  requires  10"  function  evaluations  for  M  features.  The  number  of  function  evaluations 
is  reasonable  when  M  is  small,  but  consider  a  case  where  10  point  Gauss-Legendre  integration  is 
used  for  M  =  10  features.  In  this  case,  10^^^  or  10  billion  function  evaluations  of  /<  are  needed  to 
compute  the  saliency  of  each  feature. 

Clearly,  a  computationally  tractable  method  is  needed  for  evaluating  feature  saliency.  In 
the  next  three  sections,  tractable  methods  for  approximating  are  defined.  These  approxima¬ 
tions  are  similar,  but  each  one  uses  a  different  set  of  data  for  evaluating  the  function  /i(x,w)  = 

3.2.3  ‘Pseudo-Data’  Approximation.  The  first  method  for  approximating  A,  was  proposed 
(implicitly)  by  Ruck  [59].  This  metric  is  reviewed  in  Section  2.5  of  Chapter  n.  Ruck’s  metric 
involves  what  could  be  csJled  ‘pseudo-sampling’  from  the  M-dimensional  feature  space  72". 

A  description  of ‘pseudo-seunpling’  follows  [59].  For  every  training  vector,  each  feature  input  is 
sampled  uniformly  over  its  observed  range  while  the  other  feature  inputs  correspond  to  the  training 
vector  being  sampled.  This  corresponds  to  PRM  ‘pseudo-samples,’  where  P  is  the  niunber  of 
vectors  in  the  training  set,  R  is  the  number  of  uniformly  spaced  sample  points  per  feature  dimension, 
and  M  is  the  number  of  features  as  before.  For  the  saliency  computation  of  each  feature,  the  set  of 


66 


“pseudo”  data  points  remains  the  same.  This  approximation  is  computationally  tractable,  because 
the  number  of  function  evaluations  now  increase  linearly  with  M  rather  than  exponentially. 


For  the  tth  feature,  Ruck’s  metric  (defined  earlier  in  Equation  18)  is  described  again  for 
reference.  FoUowing  Reinhart’s  notation  [53:21-22],  let  d„  be  the  vector  of  R  uniformly  spaced 
pseudo  points  covering  the  range  of  the  mth  input  feature.  The  rth  component,  d,,  of  d  can  be 
defined  as: 

d,=minx„-Kr-l)°^^^?~^^”  r  =  l,2,-.,R  (28) 


The  Ruck  saliency  metric  for  feature  t,  A^,  is  defined  again  for  convenience  as 


p  u  R  K 


P=:l  m=l  r=l  t=l 


dzt, 


(29) 


where  P  is  the  number  of  training  vectors  x;  M  is  the  number  of  features;  R  is  the  number  of 
tmiformly  spaced  points  covering  the  range  of  each  input  feature  found  in  the  training  set;  K  is 
the  number  of  output  classes;  the  vector  vector  x**  with  its  mth  component  replaced 

by,  d,  the  rth  component  of  d^;  and  (x^(r)»  w)  indicates  that  the  derivative  is  evaluated  with  the 
feature  vector  estimates  of  the  trained  network  weight  parameters  w. 

The  approximation  Aj  represents  the  total  network  S2diency  for  PM R  ‘pseudo-sampled’  data 
points.  Let  the  average  ‘pseudo-saliency’  be  defined 


(30) 


For  making  empirical  comparisons  between  Aj  and  the  various  approximations  to  Aj,  the  average 
‘pseudo  saliency’  Af*®"**"  is  most  appropriate  . 

3.2.4  Random  Data  Approximation.  A  second  method  for  approximating  Aj  can  be  defined 
using  random  samples  from  the  M-dimensionzd  feature  space  This  approximation  is  similar 
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to  but  random  samples  are  used  instead  of ‘pseudo-samples.’  For  data  normalised  to  a  unit 

hypercube,  the  nth  random  sample  is  created  by  drawing  a  UNF(0, 1)  random  number  for  each 
featiure. 

The  random  data  approximation,  denoted  is  given  as 

n=l  k=l 

where  x”  denotes  the  nth  random  samples  drawn  from  ft** ,  and  is  the  total  number  of  random 
samples.  The  random  data  approximation  to  Aj  represents  the  average  network  saliency  over  iV' 
randomly  sampled  data  points. 


3.2.5  Known  Data  Approximation.  A  third  method  for  approximating  A^  is  defined  by 
sampling  only  the  known  data.  Guo  and  Urig  suggested  a  similar  metric  for  sensitivity  analysis  of 
nuclear  power  plant  thermal  data  [21].  Guo  and  Urig’s  metric  is  different  because  they  consider 
the  sensitivity  for  each  of  the  K  outputs  separately.  For  the  known  data  saliency,  the  network’s 
feature  sensitivity  is  evaluated  in  a  manner  proportional  to  the  total  likelihood  function  of  the  data. 
Note  that  regions  of  maximum  total  likelihood  may  not  correspond  to  regions  of  maximum  feature 
saliency,  as  in  Example  one  in  Figure  4  on  page  63. 

The  known  data  saliency,  denoted  Af“**,  is  given  as 


This  metric  requires  a  factor  of  RM  fewer  computations  than  The  known  data  approxi¬ 

mation  to  Ai  represents  the  average  network  saliency  over  the  P  known  data  points. 


The  various  approximations  to  A*  are  analyzed  in  the  next  section.  The  known  data  approx¬ 


imation  provides  results  similar  to  the  other  metrics.  It  is  probably  the  best  choice  for  practical 
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use  since  it  evaluates  regions  of  the  feature  space  where  the  data  is  known,  and  it  requires  fewer 
calculations  than  the  integrated  or  pseudo  metrics. 

3.2.6  Analysis.  The  derivative-based  metrics  introduced  in  the  previous  sections  are  ana¬ 
lysed  in  this  section.  To  do  this,  the  three  examples  shown  in  Figure  4  are  revisited.  However, 
for  this  analysis,  a  N(0, 1)  random  variable,  denoted  ZnoiM>  is  added  as  a  second  feature  to  each 
class.  Both  features  are  normalised  between  0  and  1.  The  relative  importance  of  the  ‘truly  salient’ 
feature  to  noise  is  analyzed.  The  saliencies  for  these  examples  are  evaluated  for  sensitivities  to 
over-training  and  redundant  middle  nodes.  The  metrics  and  are  also  analyzed  for 

their  sensitivity  to  sampling. 

Since  these  examples  are  linearly  separable,  a  minimal  network  (i.e.  no  redundant  middle 
nodes)  consists  of  one  middle  node.  For  each  level  of  training,  the  minimal  network’s  results  are 
used  as  the  baseline  for  comparison.  Figure  5  is  a  collection  of  three  dimensional  plots  summarizing 
the  three  examples.  Figure  6  and  Table  5  are  used  to  summarize  the  sensitivities  of  the  derivative- 
based  saliency  metrics.  In  Table  6,  a  general  summary  of  these  sensitivities  is  presented. 

For  all  the  examples,  the  networks  used  400  training  vectors,  two  output  nodes,  and  a  log- 
linear  declining  learning  rate  (see  Chapter  I  for  definition).  The  neural  network  plots  for  each 
example  in  Figure  5  are  from  a  single  realization  of  a  trained  neural  network  using  one  middle 
node.  For  aU  of  the  plots,  the  z-axis  corresponds  to  the  variable  Zi,  and  the  y-axis  corresponds  to 
the  variable  z„oi«e. 

In  the  first  and  second  rows  of  Figure  5,  the  z-axis  represents  the  value  of  the  individual 
likelihood  functions  and  the  trained  neural  network  output  ftmctions  for  each  class.  In  the  third 
and  fourth  rows,  the  z-axis  represents  the  value  of  the  ith  ‘saliency  function’  The 

terminology  ‘saliency  function’  is  appropriate,  because  each  of  the  derivative-based  saliency  metrics 
are  defined  as  a  series  of  ‘saliency  function’  measurements.  In  the  last  row  of  Figure  5,  the  z-axis 
represents  a  ratio  of  the  ‘saliency  function’  for  Zi  divided  by  the  ‘saliency  fimction’  for  Znou«- 
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The  ‘saliency  functions’  shown  in  the  third  and  the  fourth  row  vary  greatly  across  the  feature 
space.  They  are  most  peaked  where  the  neural  network’s  output  function  has  the  greatest  slope.  All 
three  derivative-based  metrics  obtain  markedly  different  values  due  to  different  ‘saliency  fimction’ 
measurements.  For  instance,  in  the  first  ex2unple,  the  known  data  saliency  metric  would  be  the 
smallest,  because  the  known  data  is  from  a  region  where  the  ‘saliency  function’  is  not  peaked. 

The  ‘saliency  fimction’  ratio  shown  in  the  fifth  row  is  a  relatively  flat  or  a  constant  function. 
Where  the  ratio  is  not  flat,  it  fluctuates  due  to  the  division  of  two  very  small  numbers.  Also,  in 
the  regions  where  the  ratio  is  not  constant,  there  is  very  little,  if  any,  true  data. 

The  ‘saliency  function’  ratio  at  any  point  can  be  interpreted  as  the  relative  importance  of  the 
feature  ij  to  the  feature  inoUe-  The  ‘saliency  function’  ratios  in  Figure  5  indicate  that  the  relative 
importance  of  one  feature  to  another  is  nearly  constant  regardless  of  how  or  where  the  ‘saliency 
function’  is  measured.  Therefore,  when  measuring  the  relative  importance  of  a  featurp;,  all  of  the 
metrics  perform  about  the  same.  This  can  be  seen  in  Figure  6  when  comparing  integrated  and 
known-data  saliencies,  and  in  the  last  row  of  Table  5  when  comparing  the  sample  S2diency  function 
ratios. 

The  results  shown  in  Figure  6  document  the  saliency  metric  sensitivities  to  training  and 
redundant  middle  nodes.  For  completeness,  the  results  documented  in  Figure  6  are  summarized  in 
Table  5.  In  Figure  6,  the  z-axis  correspo  to  the  number  of  middle  nodes  used,  and  the  y-zixis 
corresponds  to  the  value  of  the  ‘saliency  function’  ratio.  Each  line  of  the  pl^ts  corresponds  to  a 
different  sunount  of  training.  The  smallest  amount  of  training  corresponds  to  the  point  where  the 
network  classification  error  is  initially  minimized.  A  point  on  the  plots  represents  an  experiment 
involving  30  neural  network  runs  with  the  corresponding  amount  of  training  and  number  of  middle 
nodes. 

For  each  experiment,  the  ‘saliency  function’  ratios  for  the  integrated  and  the  known  feature 
s^lliency  metrics  are  about  the  same.  The  ‘saliency  function’  ratios  corresponding  to  the  metrics 
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Middle  Nodes  Middle  Nod^  Middle  Nodes 

Known-Data  Derivative-Based  Saliency  Results 


Figure  6.  Summary  of  Middle  Node  and  lYaining  Sensitivities 
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Table  5.  Derivative-Based  Saliency  Metrics  for  Three  Examples 


Bmample  1 

Bmample  3 

BmampleS 

(tailfl  mot  ovcrlapplmg) 

{talU  fUglitlj  oreiiappimg) 

(tall*  overlappimg) 

Rel*tWe  ^•4UeBcy 

email 

large 

large 

ratio  for  X  middle  mode 

Semtitieitj  to  A 

mome 

mome 

mome 

Seaeitieilj  to  imcroMcd 

Saliemcy  Ratio 

mome 

Saliemcj  Ratio 

traiaieg  for  1  middle  mode 

Imereaeei 

Imcreaeee 

Semtitieitj  to  imeteated 

^•alieacy  fmmctiom*  ratio 

*«aliemcy  fmmctiom*  ratio 

‘•aliemcj  fmmctiom*  ratio 

traimimg  for  >  1  middle  mode 

decreatet 

decreaeee 

decreaaei 

SemeiliTity  to 

*iaUemey  fmmctiom*  ratio 

*aaUemc9  fmmctiom*  ratio  dtcteaeet  at 

*eai£emcy  fmmctiom*  ratio 

Redmmdamt  middle  mode* 

imcrearee 

kifk  amommte  of  traimimg 

decreares 

Average  '•aliemej  fmmcliom*  ratloi  over  30  rmme  of  Iraimed  memral  metworkr  mrimg  1  middle  mode 

Bpocke 

100 

•0 

100 

Aj 

a.ool 

ir.ooo 

la.aa 

pseudo 

a.otft 

li.idt 

ia.03 

J^vandom 

ia.Z47 

ir.oa 

«.9U 

la.isa 

la.ii 

^pMudo  Random  shovoi,  since  their  results  do  not  differ  significantly  from  those  shown 

for  the  integrated  and  known  saliency  metrics  in  Figure  6. 

The  first  example  with  non-overlapping  classes  has  the  smallest  ‘saliency  function’  ratios, 
in  general,  and  the  third  example  with  significantly  overlapping  classes  has  the  largest  ‘saliency 
function’  ratios.  This  was  specifically  illustrated  for  one  middle  node  and  100  epochs  in  Figure  5. 
For  a  minimal  network  of  one  middle  node,  the  ‘saliency  function’  ratio  varies  for  different  amoimts 
of  training.  In  the  third  example  where  the  two  classes  are  significantly  overlapping,  there  is 
a  definite  relationship  between  training  and  the  ‘saliency  function’  ratio  for  a  minimal  network. 
In  this  example,  the  ‘saliency  function’  ratio  increases  when  the  network  is  trained  longer.  This 
relationship  does  not  hold  for  a  network  trained  with  redtmdant  middle  nodes. 
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In  the  presence  of  redundant  middle  nodes,  a  negative  relationship  exists  between  the  ‘saliency 
function’  ratio  and  additional  training.  With  additional  training,  the  redundant  middle  nodes  begin 
to  incorporate  unnecessary  information  from  ZnoUe  without  affecting  the  network  error  rate.  As  a 
result,  the  saliency  of  ZnoUe  increases,  which  in  turn  deflates  the  ’saliency  function’  ratio.  This  type 
of  behavior  is  evidence  of  over-training. 

In  the  first  example  where  the  two  classes  are  greatly  separated,  there  is  a  great  deal  of 
flexibility  in  the  function  which  can  effectively  discriminate  between  the  two  classes.  The  minimal 
network  trains  to  a  function  which  is  not  very  steep  in  slope;  however,  a  network  output  function 
with  a  different  slope  would  also  be  effective.  On  the  average,  a  ‘saliency  function’  ratio  of  about  10 
is  produced  by  the  minimal  network  containing  one  middle  node.  Interestingly,  redundant  middle 
nodes  afford  the  flexibility  for  a  different  network  output  function,  so  there  is  an  increase  in  the 
‘saliency  function’  ratio  when  additional  middle  nodes  are  added. 

For  the  first  example,  a  network  with  eight  middle  nodes  produces  a  distorted  output  function 
compared  to  a  network  with  one  middle  node.  The  saliency  functions  shown  in  Figure  7  are  for 
one  and  eight  middle  nodes  after  five  epochs  of  training.  At  eight  middle  nodes,  there  is  graphical 
evidence  that  the  noise  feature  is  affecting  the  neural  network  saliency  functions.  After  about  five 
epochs,  the  middle  nodes  begin  ‘training’  to  the  information  in  Xnoi>e  which  deflates  the  ‘saliency 
function’  ratio.  Inspection  of  the  weights  associated  with  Xnoise  confirms  that  the  proportional 
influence  of  XnoiK  grows  with  increased  training. 

For  the  second  and  third  examples  where  the  two  classes  are  overlapping,  the  minimal  network 
treiins  to  a  function  which  is  very  steep.  There  is  not  as  much  flexibility  in  the  ‘choice’  of  a 
network  output  function  which  can  effectively  separate  the  classes.  For  these  examples,  the  ‘saliency 
function’  ratio  does  not  increase  in  the  presence  of  redundant  middle  nodes.  After  sufficient  training, 
the  redundant  middle  nodes  begin  to  incorporate  unnecessary  information  from  Xboiic,  which  makes 
the  saliency  function  between  Xi  and  Xnoite  decrease.  This  behavior  is  best  seen  when  looking  at 
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Table  6.  Siumnary  of  Derivative-Based  Saliency  Metrics 


Cril«yi» 

1  raadoas 

iatvactaUe 

tractable 

tractable 

tractable 

SMKpUmg 

r«fioa« 

UcU4«  ngioai 
of  aaciaiaai  •eaaitielty 

map  iaclmde  regioas 
of  maaimam  sea^tieitp 

map  iaclmde  regioas 
of  maalaiam  seasitieltp 

0  Map  iaclmde  regioas 
of  maaimam  seasitivitp 

0  iaclades  oalp  regioas 
of  Uhelihood 

Naoibet  of  Bvalvalloat 

-EUiUil 

PRM 

*r' 

P 

SeaailWUf  to  R 

mot  teeled 

aot  sigaiBcaat 

aot  sigaiflcaat 

N/A 

Soatilivitj  to 
Redaadoat  lOddlo  Nodes 

»•* 

»«• 

y«» 

Seaeitieitj  to 

Ttaialag 

if  there  are  redaadaat 
aiiddle  aodea 

if  there  are  redaadaat 
middle  modes 

if  there  ate  redaadaat 
middle  modes 

if  there  are  redaadaat 
■dddle  aodes 

Tactical  DecUioas 

A,  middle  modes 

A,  middle  modes 

A«  middle  modes 

Perfotmaacc 

good  (whom  tractable) 

good 

good 

good 

the  third  example  for  1000  and  2000  epochs.  In  the  second  example,  the  ‘saliency  function’  ratio 
corresponding  to  two  middle  nodes  begins  to  decrease  after  5000  epochs  of  training.  However,  5000 
epochs  is  not  sufficient  to  affect  the  ‘saliency  function’  ratio  for  more  than  two  middle  nodes.  In 
the  third  example,  a  similar  phenomena  also  occurs  at  500  epochs  of  training. 

The  integrated  saliency  metric  and  the  three  approximations,  Aj**"****,  AJ"***””,  and  Af*** 
are  summarized  in  Table  6. 

3.2.1  Summary.  In  this  section,  a  framework  for  derivative-based  saliency  is  presented.  In¬ 
tegrated  saliency  is  introduced  as  a  ‘truth  model’  for  derivative-based  saliency.  Three  tractable 
approximations  for  integrated  saliency  are  defined.  The  approximations  were  analyzed  for  sensitiv¬ 
ities  with  respect  sampling,  training,  imd  redundant  middle  nodes. 

All  of  the  metrics  are  sensitive  to  the  number  of  middle  nodes  and  to  the  amount  of  training. 
It  is  important  to  minimize  the  number  of  redundant  middle  nodes,  since  the  saliency  metrics  are 
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most  sensitive  to  the  effects  of  training  in  the  presence  of  redundant  middle  nodes.  This  is  because 
the  redundant  middle  nodes  begin  to  incorporate  unnecessary  information  from  irrelevant  features. 
After  sufficient  training,  the  extraneous  parameter  weights  associated  with  redundant  middle  nodes 
will  increase,  which  is  indicative  of  data  memorization.  If  the  parameter  weights  do  not  increase 
proportionally  with  all  the  features,  the  saliency  results  become  contaminated. 

Although  similar  results  are  produced  by  all  of  these  metrics,  the  known  data  metric  Af*‘* 
in  Equation  31  on  page  68  is  probably  the  best  choice.  This  metric  is  measured  in  regions  where 
the  data  is  known,  and  it  requires  fewer  calculations  than  the  integrated  and  ‘pseudo-sampling’ 
metrics. 

3.3  Relating  Derivative  and  Weight-Baaed  Saliency 

In  this  section,  a  mathematical  connection  is  shown  between  the  known  derivative-based 
saliency,  A,^**“,  and  weight-based  saliency.  First,  a  form  of  the  weight-based  saliency  is  defined. 
Then,  the  theoretical  relationship  between  A^ and  the  vector  of  weights  emanating  from  feature 
t,  ,  is  derived.  This  relationship  is  evaluated  using  ‘saliency  function’  ratios  for  the  three  examples 
shown  in  Figure  5. 

3.3.1  Background.  A  weight-based  saliency  metric  is  suggested  by  Tarr  [72:45].  Tarr  con¬ 
ceived  weight  saliency  based  on  the  idea  that  weights  coimected  to  important  features  attain  the 
largest  values  (absolute  values);  weights  connected  to  less  important  features  attain  smaller  values 
(absolute  values);  and  weights  connected  to  unimportant  features  would  probably  attain  values 
somewhere  near  zero  [72:45].  Tarr  defined  weight  saliency  as 

T-  = 

i 
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where  tD,^  denotes  the  jth  element  of  w,^  or  the  estimated  weight  between  the  tth  input  feature  and 
the  jth  hidden  node.  Tarr’s  definition  of  weight  saliency  is  based  on  the  Euclidean  norm  of  the 
estimated  weights  associated  with  a  feature  input.  A  general  definition  of  weight  saliency  based  on 
the  definition  of  the  rth  norm  of  a  feature’s  estimated  weights  is  given  as 

The  effectiveness  of  weight-based  saliency  depends  on  two  things  [72:45]. 

1.  w/  must  be  from  a  trained  neural  network  of  appropriate  complexity. 

2.  The  input  features  must  be  normalized  to  have  approximately  the  same  ranges. 

Computationally,  this  metric  is  much  simpler  than  other  available  saliency  metrics.  Tarr  presents 
results  which  show  T{  provides  feature  saliency  rankings  similar  to  [72:49]. 


3.3.2  Theoretical  Relationship.  The  derivative-based  saliency,  Af***,  is  defined  as  a  function 
of  the  estimated  weight  parameters  and  the  known  trsuning  data.  The  estimated  weight  parameters 
are  defined  as  a  function  of  the  known  training  data  used  to  train  the  network.  The  derivative-based 
saliency,  A4“*‘,  and  the  estimated  weight  parameters  used  to  define  weight  saliency  are  interrelated. 
In  this  section,  an  upper  boimd  is  derived  for  which  relates  these  quantities.  The  upper  bound 
of  Aj*‘‘  for  any  feature  i  is  the  vector  product  of  a  constant  vector  times  a  vector  containing  the 
absolute  value  of  the  estimated  weight  parameters  associated  with  feature  t  [70]. 

The  term  ^(x,  w)  of  Af*‘*  defined  in  Equation  31  on  page  68  can  be  expanded  as 
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j=i 


where  deimitions  of  S},  xj,  z^,  and  wfj  are  reviewed  in  Section  2  of  this  chapter.  Using  this 
expression,  is  defined  as 


p  K 


a?*“  =  p-^EE 


]>=lk=l 


H 


i=i 


(33) 


The  theoretical  relationship  between  Af***  and  the  estimated  weight  parameters  associated 
with  feature  i  is  developed  by  expanding  about  the  K  output  nodes  and  then  about  the  H 
middle  nodes. 
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An  inequality  sign  replaced  the  equality  sign  for  Equation  34  when  the  triangle  inequality  [38:92] 
was  used  to  decouple  the  outputs.  For  a  given  input  exemplar  p  and  middle  node  j,  the  result  in 
the  brackets  {  •  },  prior  to  each  |  in  Equation  36  is  the  same  regardless  of  which  feature  is  being 
examined.  Therefore,  a  constant  is  substituted  into  Equation  36  for  the  quantity  in  the  brackets 
{•}  prior  to  each  giving 

+  (37) 

P=1 

Now,  since  [  is  independent  of  p,  then  replacing  52p=i  ^  with  a  new  constant  and  rearranging 

terms  gives 

<  P-^  +  p-‘  (38) 

Now,  let  jw^l  be  an  P-dimensional  vector  containing  the  absolute  values  of  the  weights 
associated  with  the  *th  feature,  (i.e.  |w,^|  =  , •  •  •,  Iwj^l)'),  and  let  ^  be  an  P^-dimensional 

vector  of  constants  associated  with  the  middle  nodes  (i.e.  #  =  •  •,$#]').  The  vector  of 

constants  ^  is  independent  of  i.  Therefore,  the  known  derivative-based  saliency  for  the  tth  feature 
is  boimded  above  by  a  constant  linear  combination  of  the  vector  |w,^|.  That  is 

S.3.S  Analysis.  To  study  this  mathematical  connection,  the  weight  saliency  metric  T^ ,  for 
r  equal  to  one,  two,  and  infinity,  is  compared  to  the  saliency  metric  Af'*'*  for  sensitivity  to  training 
and  redundant  middle  nodes.  Figure  8  shows  average  ‘saliency  function’  ratio  results  over  30  neural 
networks  for  the  three  two-class  multivariate  examples  summarized  in  Figure  5.  The  networks  were 
trained  for  100,  80,  and  100  epochs,  respectively.  The  z-axis  represents  the  number  of  middle  nodes 
and  the  p-axis  corresponds  to  the  ‘saliency  function’  ratio  of  Zi  to  ZnoUe-  For  each  experiment,  the 
average  ‘saliency  fimction’  ratio  for  known  data  is  plotted  against  the  average  ‘saliency  function’ 
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Figure  8.  Summary  of  Derivative  versus  Weight  Saliency 

ratios  for  weight  saliency.  In  Figvire  8,  it  can  be  seen  that  the  weight-based  metric  produces 
approximately  the  saune  ratio  as  the  known  data  saliencies  when  one  middle  node  is  used.  This 
occurs  for  two  reasons: 

1.  The  metric  is  bounded  above  by  where  |Wj|  is  a  .ff-dimensioiiux  vector  of  the 

absolute  value  of  the  weights  from  feature  t  to  the  H  middle  nodes,  and  is  a  if -dimensional 
vector  of  constants  associated  with  the  H  middle  nodes. 

2.  Only  one  middle  node  is  used  to  train  the  neural  network.  Therefore,  the  constant  term  ^  in 

|w,^|  cancels  when  a  ratio  is  taken.  When  there  is  just  one  middle  node,  the  ratio  of  Aj“*“ 
to  A^:f“  is  equal  to  the  ratio  of  |wjl  to  iw^oi,el»  since  the  triangle  inequality  in  Equation  34 
is  not  needed. 

For  more  than  one  middle  node,  the  ratio  of  the  weight  saliencies  is  adways  smaller  than 
the  ratio  of  the  derivative-based  saliency.  With  additional  middle  nodes  the  network  is  over¬ 
parameterized.  As  middle  nodes  are  added,  the  weights  associated  with  x^om  increase  faster  than 
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the  weights  associated  with  «i.  Due  to  the  over-parameterization,  the  parameters  between  Zboim 
and  the  redundant  middle  nodes  incorporate  unnecessary  information  about  the  feature  XboUc-  This 
behavior  can  be  associated  with  over-training.  The  sensitivity  of  the  ‘saliency  function’  ratios  to 
increased  middle  nodes  is  revisited  in  Section  3.5  for  a  ‘real  world’  problem.  A  final  observation 
is  that  the  similarity  in  the  saliency  rankings  and  ‘saliency  function’  ratios  of  the  collection  of 
weight-b2ksed  saliencies  indicates  that  the  choice  of  r  for  makes  no  appreciable  difference. 

3.3.4  Summary.  The  theoretical  relationship  between  and  jw^  |  provides  a  mathemat¬ 
ical  connection  between  the  metrics  A^**®  and  T[ : 

•  Ai“‘®  is  a  vector  product  of  a  constant  vector  and  the  vector  Iw^  ] 

(i.e.  it  is  a  linear  combination  of  jw^  |) 

•  TJ  corresponds  to  the  rth  norm  of  w?  (or  |Wj  |) 

An  analysis  of  this  relationship  shows  that  the  relative  weight-based  saliencies  are  equal  to 
the  relative  derivative-based  saliencies  for  nevural  networks  with  one  middle  node.  In  the  presence  of 
additional  redtmdant  middle  nodes,  empirical  results  encompass  a  range  of  two-class  multivariate 
examples  (from  class  distributions  not  overlapping  to  class  distributions  significantly  overlapping) 
indicates  that  the  relative  weight-based  saliencies  are  smaller  than  the  relative  derivative-based 
sediencies. 

3.4  Pc-based  Feature  Saliency  Metrics 

In  this  section,  neural  network  P^-based  feature  saliency  metrics  are  discussed.  These  metrics 
are  appropriate  for  classification  problems,  but  not  for  regression  problems.  Probability  of  error 
metrics  are  developed  for  feedforward  neural  networks  which  approximate  a  Bayesian  optimal  dis¬ 
criminant.  The  assumptions  necessary  for  this  approximation  are  discussed  in  Chapter  I.  There 
are  several  contributions  in  the  area  of  metrics  presented  in  this  section. 
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In  this  section,  the  background  on  Pe'hased  feature  saliency  metrics  is  reviewed.  Then,  an 
exact  relationship  is  shown  between  a  Pg-based  metric  and  Aj.  Next,  a  new  P^-based  neural  network 
featme  metric  Fj  is  defined  using  a  restricted  subset  of  the  terms  associated  with  Two  results 
related  to  F^  are  derived.  One,  the  relationship  between  F^  zind  A^‘‘*  is  derived,  and  two,  an  upper 
bound  for  Fj  is  derived.  Analysis  is  presented  which  compares  the  metrics  Fj  and  Af***,  and  Af**"**® 
for  both  a  two  class  and  a  fotir  class  problem.  Finally,  the  results  presented  in  this  section  are 
summarized. 

S.4-1  Background.  The  Pg  feature  evaluation  metric  is  commonly  used  whenever  the  goal  is 
minimizing  classifier  error  rate  with  features  of  equal  measurement  cost.  As  a  result,  probability  of 
error  is  often  used  as  a  bench  mark  for  independently  measuring  the  classification  error  associated 
with  using  either  a  single  feature  or  a  set  of  features  for  classification  neural  networks  [51,  59,  60]. 

The  Pe  metric  defined  in  Section  2.4.1  page  38  Equation  16  is  defined  again  here  for  conve¬ 
nience, 

Pe(x)  =  Pxil  -  max{P(Ci|x),-  -,P(C’/clx)}]  ,  (40) 

where  x  is  the  vector  of  features  for  which  P^  is  measured,  E^l  ■  ]  is  the  expectation  operator,  and 
P(Cfc|x),  is  the  posterior  probability  of  class  k  for  x,  defined  as 

P(Ct)P,(x) 

where  P{Ck)  is  the  prior  probability  of  class  k  and  P(x|Cfc}  is  the  class  conditional  probability 
function  of  x  for  class  k. 

By  definition,  Si^=i  ^(C'i|x)  =  1-  Also,  the  class  specific  probability  of  error  associated  with 
the  kth  class  is  given  as  Pe(C'fc,x),  is  given  as; 
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(42) 

(43) 


Pe{C„x)  =  l-P(C'*|x) 

=  Ei’CC.W 

i/k 


Under  certain  necessary  conditions,  the  feedforward  neural  network  approximates  a  Bayes 
optimal  discriminant  function  in  the  limit  (see  Section  1.2  of  Chapter  I).  The  implications  for 
interpreting  a  trained  feedforward  neural  network  in  the  limit  as  an  approximation  for  a  Bayes 
optimal  discriminant  function  are  that  classical  definitions  associated  with  probability  of  error 
given  in  Equations  40,  41,  and  42  can  be  redefined  as  an  approximation  in  neural  network  terms. 
Specifically,  using  the  fact  that  Zfc(x,w)  »  P((7k|x),  the  neural  network  approximations  to  the 
class  specific  probability  of  error  Pe{Ck,x)  and  classifier  probability  of  error  Pe{x)  are  defined  in 
Equations  45  and  49,  respectively. 

Priddy  illustrates  a  relationship  between  class  specific  probability  of  error  Pg{k,x)  and  the 
derivative-based  feature  saliency  metric.  A,-  defined  in  Equation  29  [51].  This  relationship  relies 
on  the  assumptions  necessary  for  feedforward  neural  networks  to  approximate  a  Bayes  optimal 
discriminant  function  in  the  limit.  Priddy’s  defines  a  Bayesian-based  saliency  metric,  Hi,  as 


M  R  K 


p=l  m=l  r=l  fc=l 


dPe{CkjK^{r)^ 


dXi 


(44) 


where  P  is  the  number  of  training  vectors  x;  M  is  the  number  of  features;  R  is  the  number  of 
unifonnly  spaced  points  covering  the  range  of  each  input  feature  found  in  the  training  set;  K  is  the 
number  of  output  classes;  the  vector  3^^,)  is  the  pth  exemplar  x’'  with  its  mth  component  replaced 
by,  dr  the  rth  component  of  defined  in  Equation  28;  (3(^(r)|W)  indicates  that  the  derivative 
is  evaluated  with  the  feature  vector  x^(r)  and  the  final  estimates  of  the  trained  network  weight 
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parameters  w;  and  P«(J;,x,  w)  is  a  neura!  etwork  approximation  to  the  class  specific  probability 
of  error. 


Priddy  defines  the  neural  network  approximation  for  class  specific  probability  of  error  Pe{Ck,  x,  w) 
as 

ACC*.  X,  w)  =  53  zi(x,  w)  (45) 

which  is  gimilar  to  Equation  43.  Now,  substituting  Equation  45  into  Equation  44  results  in 


P  M  R  K 

|>=1  m=l  r=l  fc=l 


^  52|(x,w) 
^  dx 


(46) 


Using  the  triangle  inequality,  Priddy  proves  that  Qi  is  bounded  above  by  a  simplified  saliency  metric 
ili  [51],  i.e. 

Hi  <  fti,  (47) 


where 


p  u  R  K  K 
p=lm=lr=lfc=ll^i: 


dXi 


The  Bayesian-based  metric  Hi  is  related  to  the  metric  A^,  since  it  involves  the  partials  of  Zt 
with  respect  to  Zj  and  the  pseudo-sampling  of  unknown  vectors  from  the  feature  space  [59].  Priddy 
shows  is  a  scalar  multiple  of  the  Ai  in  Equation  29: 


Cli  =  {K-l)A,,  (48) 

where  K  is  the  total  number  of  output  classes  [51].  Therefore,  the  two  saliency  metrics,  Cli  and  A^, 
produce  identical  feature  rankings. 

3.4-2  Equality  of  Two  Pe-based  Metrics.  In  this  section,  it  is  shown  that  the  metric  (Ij  is 
exactly  equal  to  the  metric  A^.  Using  the  relationship  shown  in  Equation  42,  a  neural  network 
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approximation  to  class  specific  probability  of  error  can  also  be  defined  as 


A(C*,x,w)  =  1  -  x*(x,w). 


(49) 


Now  substituting  Equation  49  into  Equation  44  and  simplifying  illustrates  the  equality  of  Oj  and 

A,: 


r=lm=lr=lfe=l 


dXi 


P  M  R  K 

=  EEEE 

p=l  m=l  r=l  k=l 

^  [l  -  *fc«(r).^)] 

dXi 

P  U  R  K 

=  EEEE 

p=l  m=l  T=1  k=l 

^^»«(r).w) 

dXi 

=  A, 


(50) 


The  relationship  between  (Ij  and  is  derived  exactly  without  recourse  to  (li  of  Equation  47  [51]. 


3.4-3  Derivation  of  a  New  Pg-baaed  Metric.  The  definition  of  Pg  reviewed  in  Equation  40 
is  used  to  derive  a  new  Bayesian-based  saliency  metric.  This  metric  is  related  closely  to  the  neural 
network  approximation  to  the  Bayesian  classification  error  Fe(x)  which  is  defined  in  neural  network 
terms  as 

A(x,w)  =  ' 

p=i 

=  (62) 

where  A(x’*,  w)  is  the  probability  of  error  associated  with  the  pth  exemplar  from  a  set  of  P  total 
exemplars.  The  new  Bayesian-based  saliency  metric  for  the  tth  feature  is  defined  using  Pg(x^,w) 

r,  =  p-‘f; 

P=1 


dPe(x',w) 


dXi 


(53) 
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Like  this  metric  depends  only  on  the  known  data.  Let  be  a  function  which  is 

given  as 

•  •  • .  w)} 

where  ib,..,  represents  the  subscript  k  associated  with  the  maz{2i(x',w), •  •  •,zjc(x>’,w)}.  Using 
w)  in  Equation  53  becomes 

A(x'’,  w)  =  1  -  Zfc„.(x'’,  w)  (54) 


givmg: 


p-i  ^  I 
p=i 


d  Xi 


3.4-4  Theoretical  retatiotuhips.  The  relationship  between  and  the  derivative-based  saliency 
A^‘“  is  derived  from  the  definition  of  Af***: 
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In  summary,  the  exact  relationship  between  F^  and  Af is 
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(55) 


87 


It  can  abo  be  shown  that  is  an  upper  bound  for  Fi.  This  relationship  is  derived  using 

the  triangle  inequality  on  the  second  term  of  Equation  55  in  concert  with  the  Bayesian  relationship 
that  Zfc(x,  w)  =  1  as  follows: 
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(56) 


For  a  two  class  problem,  the  new  metric  is  at  its  upper  bound  exactly.  That  is:  Fi  =  ^ 


smce 


p  K 

E 

P—1  k^^asas 


when  K  —  2.  This  means  that  for  a  two  class  problem 


5zjk(x'’,w) 
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For  more  than  a  two  class  problem,  the  metric  Fj  will  be  at  its  upper  bound  only  if  the 
partial  derivatives  for  k  ^  ib„..  are  all  the  same  sign.  To  analyse  the  partial  derivatives. 
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Equation  25  is  shown  again  below  for  convenience  as 


•  «*>0 
•  «,•  >  0 

•  are  constants 


Therefore,  it  is  the  middle  node  to  output  weights  Wj^  which  influence  whether  the  partial  deriva¬ 
tives  will  be  the  same  sign  for  all  outputs.  For  a  net  with  just  one  middle  node,  all  the  derivatives 
will  be  the  same  sign  if  the  weights  are  the  same  sign  for  all  h  ^ 


3.4.5  Analysis.  The  upper  bound  for  Fi  is  investigated  empiricalljr  with  a  two  class  and  a 
four  class  problem.  For  the  two  class  problem,  the  XOR  problem  is  used.  The  exclusive-or  (XOR) 
problem  illustrated  in  Figure  9  is  a  standard  benchmark  problem  used  with  neural  networks.  In 
this  nonlinear  classification  problem,  no  single  line  can  be  drawn  to  separate  class  1  and  class  2 
regions.  Five  hundred  data  points  are  randomly  generated  for  the  XOR  problem.  A  training  set  of 
400  and  a  training-test  set  of  100  are  used. 

Saliency  metric  results  for  and  Fj  are  siunmarised  for  30  ‘trained’  neural  networks  which 
were  trained  with  the  same  data  set,  but  with  different  random  initial  weights  and  a  different 
random  order  of  training  vector  presentation.  The  neural  networks  used  four  middle  nodes.  Log- 
linear  declining  learning  rates  were  used  to  improve  the  neural  network’s  convergence  to  a  solution. 
For  all  runs,  700  epochs  were  used.  Also,  a  seven  percent  mininiiim  training  set  classification  error 
was  required  for  the  network’s  solution  to  be  considered  from  a  ‘trained’  network.  In  total,  51 
networks  were  trained;  21  networks  did  not  meet  the  seven  percent  requirement.  The  average 
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Figure  9.  The  XOR  Problem 


reprinted  from  [5] 


Table  7.  XOR  Problem:  Saliency  Metric  Mean»  for  30  Trained  Networks 


Feature 

Ti 

z 

1.001 

1.001 

V 

1.078 

1.078 

bias 

0.84 

0.84 

training  data  used 

training  and  test  set  classification  errors  for  the  remaining  30  neural  networks  were  1.59  and  2.96 
percent,  respectively.  The  saliency  was  computed  for  the  30  networks  at  700  epochs.  As  expected, 
the  metric  Pj  (to  within  roundoff  error)  is  exactly  equal  to  its  upper  bound  of  |A^‘*.  This  is 
demonstrated  with  the  XOR  problem  in  Table  7. 
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Table  8.  Four  Class  Probl 


em:  Saliency  Metric  Means  for  30  lYained  Networks 


Feature 

lA?^ 

F< 

X 

0.879 

0.843 

y 

0.567 

0.606 

bias 

0.360 

0.387 

training  data  used 

The  metric  Fj  is,  generally,  less  than  its  upper  bound  when  used  with  more  than  a  two 
class  problem.  A  four  class  problem  with  two  variables  is  studied.  The  classes  are  multivariate 
normally  distributed  with  an  identity  matrix  for  the  covariance  matrix.  The  mean  vectors  for 
the  four  classes  are:  (4.5,  2.17),  (2.0,  6.5),  (7.0,  6.5),  and  (12.0,  6.5).  Five  hundred  data  vectors 
are  randomly  generated  for  this  problem:  400  for  the  training  set  and  100  for  the  training-test  set. 
Again,  saliency  metric  results  are  summarized  for  30  ‘trained’  neural  networks  trained  with  the  same 
data  set.  For  all  nms,  a  minimal  network  of  two  middle  nodes  (determined  from  a  number  of  pilot 
simulations),  a  log-linear  declining  learning  rate,  and  500  epochs  were  used.  Also,  a  five  percent 
minimum  training  set  classification  error  was  required  for  the  network’s  solution  to  be  considered 
from  a  ‘trained’  network.  For  this  problem,  30  networks  were  trained,  and  all  the  networks  met  the 
five  percent  requirement.  The  average  training  and  test  set  classification  errors  for  the  remaining 
30  neural  networks  were  2.06  and  5.77  percent,  respectively.  The  saliency  was  computed  for  the  30 
networks  at  500  epochs.  The  metric  A^***  and  Fj  are  shown  in  Table  8. 

3.4-6  Summary.  Neural  network  Bayesian-based  feature  saliency  metrics  are  covered  in  this 
section.  These  metrics  are  developed  for  use  with  neural  networks  which  approximate  a  Bayesian 
optimal  discriminant  in  the  limit,  and  they  are  only  appropriate  for  use  with  classification  problems. 
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The  research  contributions  in  this  section  are: 


•  An  exact  relationship  is  shown  between  Ruck’s  metric  and  the  Bayesian-based  metric 
suggested  by  Priddy. 

•  A  new  Bayesian-based  neural  network  feature  metric  Pi  is  defined  using  only  a  subset  of  the 
terms  in  A^‘*. 

•  The  relationship  between  Fj  and  A^**  is  derived. 

•  An  upper  bound  for  Fj  is  derived. 

For  classification  applications,  the  metric  Fj  is  more  appealing  than  A^‘*.  The  metric  F^  is 
developed  from  classifier  error  A(x,w)  (see  Equation  51  on  page  86),  rather  than  class  specific 
error  A(h,  x,  w)  (see  Equation  45  on  page  85).  Also,  the  saliency  metric  F{  is  computed  using  only 
a  subset  of  the  terms  used  for  Af**‘  making  the  definition  of  Fi  more  succinct  than  the  definition 
of  A?***. 

3.5  Unifying  Theoretical  Relationakipa 

3.5.1  Introduction.  This  section  documents  the  relationships  between  the  set  of  available 
neural  network  feature  saliency  metrics.  Neural  network  feature  saliency  metrics  include  the  es¬ 
tablished  metrics  introduced  in  Chapter  n  and  three  new  metrics  defined  in  this  chapter:  Aj, 
and  Fj. 

The  derivative  and  weight-based  feature  saliency  metrics  are  defined  in  Table  9.  Formal 
definitions  and  applicable  references  are  presented  for  each  metric.  With  the  exception  of  the 
weight-based  saliencies,  each  of  the  feature  saliency  metrics  is  defined  in  Thble  9  as  a  function  of 
Qi.  The  metrics  differ  in  the  derivative  that  is  taken  and  in  the  definition  of  p,-  which  is  used. 

The  function  Qi,  also  given  in  Table  9  can  be  interpreted  as  the  absolute  value  of  some  form 
of  a  derivative  of  a  network  error  function.  For  the  metrics  A^,  A{,  Aj**‘,  and  Oj,  the  network 
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error  function  is  defined  —  Zk(x,w),  and  the  derivative  is  taken  with  respect  to  the  feature  of 
interest  For  the  metric  pi,  the  network  error  function  is  the  same,  but  the  derivative  is  taken 
with  respect  to  a  relevance  function  for  the  feature  of  interest.  For  the  metric  the  network 
error  function  is  the  approximate  probability  of  error  for  the  kth  network  output  for  the  pth  input 
exemplar,  i.e.  Pe{k,  w),  and  the  derivative  is  taken  with  respect  to  the  feature  of  interest  Zi.  For 
the  metric  F^,  the  network  error  function  is  the  approximate  probability  of  error  for  the  network 
given  the  pth  input  exemplar  ,  i.e.  Pt{x^,  w),  and  the  derivative  is  taken  with  respect  to  the  feature 
of  interest  Zj.  For  the  metric  Sj,  the  error  function  is  the  squared  error  defined  [dt  -  Zfc(x,  w)]^, 
and  here  a  form  of  the  Taylor  Series  approximation  to  the  total  derivative  is  Tised.  In  T&ble  10, 
detailed  notational  saliency  definitions,  as  well  as  established  theoretical  relationships  among  the 
saliency  metrics  are  presented. 

Since  previous  examples  have  been  contrived  problems,  a  ‘real  world’  problem  is  analyzed  in 
this  section.  A  description  of  this  problem  followed  by  a  comparison  and  evaluation  of  the  various 
feature  saliency  metrics  follows. 

3.5.2  Background.  The  'real  world’  problem  is  a  two  class  problem  using  forward  looking 
radar  (FLIR)  data  to  discriminate  targets  from  non-targets.  The  targets  consisted  of  tanks,  trucks, 
and  armored  personnel  carriers.  Nine  features  were  used  based  on  previous  application  experience 
by  Roggemann  [56,  57,  58]  and  by  Ruck  [60].  A  description  for  the  nine  FLER.  features  is  given  in 
Table  11. 

3.5.3  Analysis.  Analysis  of  the  FLIR  problem  is  discussed  in  this  section.  Saliency  metric 
ranks,  means,  emd  standard  deviations  sure  docmnented  for  all  of  the  metrics  presented  in  Ttible  9, 
except  the  integrated  metric  Aj.  Restdts  are  not  computed  for  A^,  because  it  is  not  computation¬ 
ally  tractable  for  this  problem.  The  average  network  accuracy  is  documented  as  the  features  are 
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Table  10.  Detailed  Notational  Definitions  and  Relationships 
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Table  11.  Description  of  FLl 

[R  Featuriis  Evaluated 

Feature  Number 

Feature 

Description 

1 

Length/ Width 

Ratio  of  object  length  to  width 

2 

Standard  Deviation 

Standard  deviation  of  pixel  values 
on  object 

3 

Maximum  Brightness 

Maximum  brightness  on  object 

4 

Compactness 

Ratio  of  number  of  pixels  on  ob¬ 
ject  to  number  of  pixels  in  rect¬ 
angle  which  bounds  object 

5 

Complexity 

Ratio  of  border  pixel  to  total  ob¬ 
ject  pixels 

6 

Mean  Contrast 

Contrast  ratio  of  object’s  mean  to 
local  background  mean 

7 

Contrast  Ratio 

Contrast  ratio  of  object’s  highest 
pixel  to  its  lowest 

8 

Bright  Pixel  Ratio 

Ratio  of  number  of  pixels  on 
object  within  10%  of  maximum 
brightness  to  total  object  pixels 

9 

Difference  of  Means 

Difference  of  object  and  local 
background  means 

Adapted  from  Ruck  [60:42] 


96 


eliminated  one-by-one  based  on  the  saliency  metric  rankings.  Finally,  the  set  of  saliency  metrics 
are  factor  analyzed  to  look  for  any  underlying  statistical  relationships  among  the  different  metrics. 

Saliency  results  are  summarized  for  30  ‘trained’  neural  networks  which  were  trained  with  the 
same  data  set,  but  with  different  random  initial  weights  and  a  different  random  order  of  training 
vector  presentation.  A  data  set  of  550  vectors  was  randomly  partitioned  for  each  neural  network 
into  training,  training- test,  and  validation  sets  of  size  300,  125,  and  125,  respectively.  The  neural 
networks  were  trained  with  four  middle  nodes  for  500  epochs  before  the  saliency  was  computed. 
From  some  pilot  runs,  fom  middle  nodes  seemed  to  be  a  minimal  network  structure  for  this  data, 
since  results  were  degraded  for  fewer  middle  nodes  and  no  significant  improvements  were  realized 
with  additional  middle  nodes.  Log-linear  declining  learning  rates  and  a  momentum  rate  of  0.30 
were  used  to  improve  the  neural  network’s  convergence  to  a  local  TuiniTnmin.  Also,  a  seven  percent 
minimum  training  set  classification  error  was  required  for  the  network’s  solution  to  be  considered 
from  a  ‘trained’  network.  Thirty  networks  were  trtuned,  and  all  30  networks  met  the  seven  percent 
requirement.  The  average  training  and  training- test  set  classification  errors  for  the  30  neural 
networks  was  3.30  and  9.39  percent,  respectively. 

In  Table  12,  rankings  are  shown  for  the  saliency  metrics,  where  a  ranking  of  ‘1’  indicates  the 
best  feature  and  a  ranking  of  ‘9’  indicates  the  worst  feature.  With  the  exception  of  the  metrics  pi 
and  Si,  similar  features  receive  high  rankings,  and  similar  variables  receive  low  rankings  across  the 
metrics. 

The  mean  saliency  and  the  corresponding  standard  error  for  the  30  runs  are  presented  in 
Tables  13  and  14.  Some  further  analysis  is  done  using  ‘saliency  function’  ratios  to  evaluate  the 
relative  saliency  of  one  feature  to  another. 

Differences  in  the  rnesm  value  of  the  derivative-based  metrics  presented  in  Section  3.2  are 
due  to  the  method  used  to  sample  the  saliency  function  |^|  when  meiuuring  the  saliency 
for  the  tth  feature.  The  relative  saliencies  of  the  feature  ranked  first  to  the  feature  ranked  last 
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Table  12.  FLIR  Problem:  Saliency  Metric  Rank 
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using  rankings  from  A^***  are  4.54,  4.56,  and  4.53,  respectively  for  A?*'"**'’,  AJ"****”,  and  A^.  As 
expected,  the  metric  Fi  (to  within  roundoff  error)  is  exactly  equal  to  its  upper  bound  of  |Af*“, 
since  this  is  a  two  class  problem.  Despite  differences  in  sampling  and  the  higher  dimensional  feature 
space  (Af  =  10),  the  derivative-based  saliencies  all  perform  quite  similarly. 

Among  themselves,  the  weight  saliencies  have  rankings  and  ‘saliency  function’  ratios  which 
ue  similar  to  each  other.  The  relative  saliency  of  the  feature  ranked  first  to  the  feature  ranked 
last  using  Af***’s  rankings  are:  3.17,  3.45,  and  3.52  for  the  three  weight  saliencies  T},  T?,  and  T“. 
The  similarity  in  the  weight-based  saliency  rankings  and  ‘saliency  function’  ratios  indicates  that 
the  choice  of  r  for  weight  saliency  T'  makes  no  appreciable  difference. 

As  seen  with  the  examples  presented  in  Section  3.3.3,  the  ‘saliency  function’  ratios  associ¬ 
ated  with  the  weight-based  saliencies  are  smaller  than  those  experienced  with  the  derivative-based 
saliencies.  However,  a  minimal  number  of  nodes  were  used  to  analyze  the  FLTB.  problem,  these 
results  indicate  that  the  relative  saliencies  of  important  to  irrelevant  features  are  degraded  when 
using  the  weight-based  saliency  metrics  for  greater  than  one  middle  node.  This  phenomena  is  also 
associated  documented  for  redundant  middle  nodes  in  Section  3.3.3. 

The  metrics  pi  and  Sj  have  similar  rankings  to  each  other  (six  of  the  nine  features  are  ranked 
the  same),  but  not  to  the  other  saliency  metrics  documented.  It  is  interesting  to  look  at  the  relative 
saliency  of  the  features  ranked  first  and  last  by  both  pi  and  s^.  In  this  case,  the  respective  relative 
saliencies  are  10.34  and  59.78,  which  are  markedly  different.  The  corresponding  relative  saliencies 
for  the  derivative-based  metrics  are  0.86,  0.89,  and  0.82,  and  the  corresponding  relative  saliencies 
for  the  weight-based  metrics  are  0.92,  0.86,  and  0.82.  These  results  indicate  that  the  metrics  pi 
and  are  fundamentally  different  from  the  other  saliency  metrics. 

The  saliency  feature  rankings  from  Table  12  are  used  to  perform  systematic  feature  elimination 
and  evaluation  of  the  network’s  corresponding  classification  accuracy.  The  features  are  eliminated 
one-by-one  based  on  the  saliency  metric  rankings,  i.e.  the  worst  features  are  removed  first.  The 
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results  from  this  analysis  are  shown  in  Figure  10,  where  the  c-axis  corresponds  to  the  number 
of  features  which  have  been  removed,  and  the  y-axis  corresponds  to  the  network  accuracy.  Four 
middle  nodes  are  used  for  the  entire  evaluation,  although  fewer  middle  nodes  might  have  been  more 
appropriate  as  more  features  were  eliminated.  A  log-linear  declining  learning  rate  was  abo  employed 
as  before.  For  each  neural  network,  the  data  set  is  partitioned  as  before  into  training,  training- 
test  and  validation  sets  of  300,  125  and  125  vectors,  respectively.  The  data-base  partitioning  and 
confidence  interval  procedures  described  in  Chapter  I  are  used  to  compute  the  network  accuracy 
Since  the  error  rate  of  the  training-test  set  no  longer  exhibits  wide  variations  after  approximate 
100  training  epochs,  150  epochs  were  used  for  each  neural  network. 

In  Figure  10,  the  mean  accuracy  and  a  95  percent  confidence  interval  error  band  (plotted 
as  horizontal  bars  about  the  means)  are  plotted  usii^  the  best  10  of  30  neural  networks  for 
Only  a  subset  of  the  ‘best’  neural  networks  are  used  to  compute  the  means  and  standard  deviations, 
since  backpropagation  learning  may  not  converge  to  a  local  minima  (see  discussion  at  the  end  of 
Section  3  of  Chapter  I)  [80:143].  Although  only  one  of  the  nine  metrics  is  represented  in  Figure  10, 
it  is  representative  of  the  other  metrics.  Identical  features  are  retained  for  many  of  the  saliency 
metrics  at  each  point  in  the  analysis,  so  the  results  from  the  other  metrics  are  similar. 

The  results  from  all  of  the  evaluated  metrics  indicate  it  is  possible  to  remove  one  to  three 
features  based  on  saliency  metric  rankings  with  little  or  no  degradation  in  the  average  network 
error.  However,  further  reduction  of  the  feature  set  requires  a  trade-oif  in  classification  error.  This 
is  demonstrated  in  Figure  10  for  Af***.  When  more  than  three  features  are  eliminated,  there  is 
great  variation  in  the  average  network  performance  depending  on  which  saliency  metric  was  used. 
This  is  not  surprising,  since  the  saliency  of  the  features  was  measured  when  all  the  features  were 
in  the  network.  If  the  feature  saliencies  were  re-evaluated  for  the  smaller  subset  of  features,  the 
remaining  features  would  be  ranked  differently  in  many  cases.  Nevertheless,  the  significant  point 
is  that  all  of  the  metrics  are  able  to  rank  expendable  features  last. 
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Number  of  Features  Eliminated  from  Feature  Set 


Figure  10.  Average  Validation  Set  Accuracy  on  FLIR  Problem  as  Features  are  Eliminated 

An  eiploratory  factor  analysis  using  a  varimax  rotation  (see  Dillon  and  Goldstein  for  details 
[IS])  is  done  on  the  correlation  matrix  of  the  various  saliency  metrics.  The  exploratory  factor 
analysis  was  done  in  hopes  of  identifying  whether  some  of  the  saliency  metrics  are  statistically 
related  to  each  other  by  some  underlying  factor.  Using  the  saliency  results  from  thirty  neural 
network  simulations,  a  factor  analysis  was  performed  for  each  feature  in  the  FLIR  problem.  The 
results  are  displayed  in  Table  15.  In  all  cases,  a  two  factor  model  was  most  appropriate. 

For  each  feature,  all  of  the  saliency  metrics,  generally,  load  together  on  the  first  factor  to 
explain  between  70  and  94  percent  of  the  variance  in  the  saliency  prior  to  the  varimax  rotation. 
From  Table  15,  it  can  be  seen  that  Af***,  Fj,  pi,  and  Sj  have  a  strong  statistical  relationship  for  this 
problem,  since  they  consistently  load  on  the  same  factor  for  each  feature.  There  is  also  a  statistical 
relationship  between  and  the  weight-based  saliencies,  T,^  TJ,  and  T|“.  The  metric  AJ^**" 

did  not  consistently  load  with  one  factor  or  the  other;  however,  seven  out  of  nine  times  it  loaded 
with  Af***. 
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It  is  interesting  to  note  that  factor  analysis  results  for  all  the  features  indicate  a  statistical 
relationship  between  pu  it,  and  the  metrics  Fj  and  for  this  problem,  since  they  correspond 
to  markedly  different  saliency  rankings  and  ‘saliency  function’  ratios.  This  statistical  relationship 
is,  most  likely,  due  to  the  fact  that  all  of  these  metrics  are  evaluated  only  at  the  known  data 
points.  This  may  indicate  an  underlying  ‘sampling  factor.’  It  is  also  true  that  the  terms  in  Pi  are 
weighted  averages  of  the  terms  in  where  the  weighting  term  for  the  pth  exemplar  is  cf .  Other 

similarities  between  pi  and  Sj  are  discussed  in  Appendix  A. 

3.5.4  Summary  The  relationships  between  the  available  neural  network  feature  saliency 
metrics  are  documented  in  this  section.  The  definitions  and  relationships  presented  in  Tables  9 
and  10  consolidate  what  is  known  about  the  available  feature  saliency  metrics. 

A  ‘real  world’  problem  is  analysed  in  this  section  to  evaluate  the  set  of  available  feature 
saliency  metrics  including  those  introduced  in  Section  3.2  and  3.4  of  this  chapter.  The  derivative- 
based  metrics  discussed  in  Section  3.2  all  have  similar  rankings  and  ‘saliency  function’  ratios.  The 
weight-based  metrics  all  have  similar  rankings  and  ‘saliency  function’  ratios.  The  weight-based 
metrics  had  rankings  similar  to  the  derivative-based  metrics,  but  the  ‘saliency  function’  ratios 
between  important  and  unimportant  features  are  smaller  for  the  weight-based  saliency  metrics. 

The  metrics  pi  and  ii  are  statistically  related  to  each  other  and  the  derivative-based  metrics 
by  an  imderlying  ‘sampling  factor,’  since  all  of  these  metrics  are  evaluated  using  the  known  data. 
However,  their  saliency  metric  rankings  are  markedly  different  from  the  other  metrics.  Also,  their 
‘saliency  function’  ratios  for  important  to  unimportant  features  are  not  similar  to  each  other  or  to 
the  other  metrics.  This  indicates  that  the  results  from  pi  and  ii  are  fundamentally  different  from 
the  other  saliency  metrics  despite  the  underlying  ‘sampling  factor.’ 

One  final  observation  is  made  from  the  feature-by-feature  elimination  using  the  saliency  metric 
rankings.  All  of  the  metrics,  despite  fundamental  differences,  had  a  set  of  ‘worst’  ranked  features 
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which  did  not  drastically  affect  the  classification  accuracy  when  removed  from  the  feature  set  (see 
Figure  10). 

3.6  Summary 

There  are  four  important  research  results  presented  in  this  ch^>ter.  These  results  are  sum¬ 
marised  in  the  next  four  paragraphs  followed  by  a  discussion  of  which  metrics  are  recommended. 

In  Section  2,  a  framework  is  developed  and  used  for  analysing  a  variety  of  derivative-based 
metrics.  Sampling  modifications  for  an  established  metric  are  evaluated  in  this  framework.  Using 
‘saliency  function’  ratios,  the  saliency  metric  sensitivities  to  sampling,  training,  and  redundant 
middle  nodes  are  evaluated.  The  metrics  do  not  appear  to  be  particularly  sensitive  to  sampling; 
however,  they  are  sensitive  to  redundant  middle  nodes  and  the  amount  of  training.  It  is  most  im¬ 
portant  to  eliminate  redundant  middle  nodes,  since  the  metrics  are  most  sensitive  to  training  effects 
in  the  presence  of  redundant  middle  nodes.  Increased  traimng  may  cause  the  weights  associated 
with  the  redimdant  nodes  to  grow  disproportionately.  This  can  contaminate  saliency  results,  since 
the  weights  from  irrelevant  features  can  get  large. 

In  Section  3,  a  theoretical  relationship  is  shown  between  derivative-based  and  weight-based 
saliency.  In  summary,  the  derivative-based  feature  saliency  metrics  are  bounded  above  by  a  con¬ 
stant  linear  combination  of  the  feature  weights.  At  one  middle  node,  the  ‘saliency  function’  ratios 
produced  with  the  weight-based  metrics  and  derivative-based  metrics  are  equal,  to  within  rotmdoff 
error.  This  occurs  because  the  constant  term  directly  cancels  for  the  ‘saliency  function’  ratio  be¬ 
tween  any  two  features.  When  additional  middle  nodes  are  used,  empirical  results  indicate  that  the 
relative  saliency  of  important  to  unimportant  features  is  smaller  with  weight-based  saliency  than 
it  is  for  derivative-based  saliency.  For  problems  with  redundant  middle  nodes,  this  is  partly  due  to 
the  growth  of  the  irrelevant  weights  associated  with  redundant  middle  nodes. 


104 


In  Section  4,  contributions  are  made  in  the  area  of  Bayesian-based  feature  sabency.  First, 
a  succinct  and  exact  relationship  is  demonstrated  between  a  previously  suggested  Bayesian-based 
metric  and  derivative-based  saliency.  Then  a  novel  Bayesian-based  saliency  metric  using  the  partial 
derivative  of  classifier  error  is  introduced.  The  computation  of  this  metric  requires  only  a  subset  of 
the  terms  associated  with  the  previously  suggested  Bayesian-based  metric.  Finally,  the  relationship 
between  the  new  Bayesian-based  saliency  and  derivative-based  saliency  is  derived,  and  an  upper 
bound  for  the  Bayesian-based  saliency  is  defined.  For  a  two  class  problem,  the  new  metric  produces 
results  exsMztly  equivalent  to  derivative-based  saliency. 

In  Section  5,  a  catalogue  of  feature  saliency  metric  definitions  and  relationships  is  presented  in 
Tables  9  and  10.  In  this  section,  the  catalogue  of  metrics  are  evaluated  for  a  ‘real  world’  problem.  On 
this  problem,  saliency  rankings,  ‘saliency  function’  ratios,  and  factor  analysis  are  used  to  empirically 
evaluate  similarities  and  differences  between  the  saliency  metrics.  One  similarity  is  that,  despite 
differences,  all  of  the  metrics  consistently  ranked  a  set  of  ‘nonessential’  features  last. 

Since  metrics  perform  differently,  the  remainder  of  this  section  summarizes  recommendations 
for  selecting  a  feature  saliency  metric. 

For  discriminant  analysis  problems  using  networks  with  more  than  one  middle  node,  the 
saliency  metric  Fj  in  Equation  53  on  page  86  is  preferable  to  Af***  in  Equation  31  on  page  68  for 
two  reasons.  First,  it  is  intuitively  appealing,  since  it  represents  a  saliency  metric  which  is  related  to 
the  average  classifier  Second,  it  provides  feature  rankings  using  a  subset  of  the  terms  required 
for  computation  of  Aj***. 

For  function  approximation  or  discriminant  analysis  problems  using  networks  with  more  than 
one  middle  node,  the  saliency  metric  Aj*‘*  in  Equation  31  on  page  68  should  be  preferred  over 
Ai.  The  saliency  metric  A^**  provides  good  feature  rankings,  requires  less  computation  than  A^, 
and  is  based  on  information  known  about  the  data  from  feature  space.  Furthermore,  the  metric  A^ 
requires  a  tsMitical  decision  concerning  the  amount  of  ‘pseudo-sampling.’ 
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For  any  classification  or  function  approximation  problem  using  a  network  with  only  one  middle 
node,  the  weight-based  metrics  (see  Equation  32  on  page  78)  are  best.  In  this  case,  the  relative 
saliencies  produced  using  weight-based  saliency  will  be  identical  to  Fj  or  For  networks  using 
more  than  one  middle  node,  weight-based  saliency  can  still  be  used  for  a  cursory  analysis  of  a 
feature’s  relative  importance.  However,  the  empirical  results  suggest  that  the  relative  importance 
of  one  feature  to  another  is  degraded  when  additional  middle  nodes  are  used.  Despite  a  potential 
degradation  in  results,  weight-based  saliency  is  still  appealing.  It  is  related  to  the  metric  and 
it  is  directly  computable  from  a  trained  network  without  reevaluating  the  data. 

The  metrics  Sj  and  pi  are  not  recommended,  because  they  require  unnecessary  work  or 

additional  assumptions  with  no  gains  in  performance  over  the  other  available  saliency  metrics.  The 
experimental  pseudo-sampling  metric  is  not  preferred,  since  it  invokes  unnecessary  random 

sampling  and  potentially  greater  computation.  The  second  order  metric,  s^,  is  not  recommended, 
since  it  is  associated  with  additional  assumptions  which  are  used  to  simplify  the  evaluation  of  the 
Taylor’s  series  expansion  of  the  network  error.  The  relevance  metric,  is  not  reconunended,  since 
it  requires  the  assumption  that  pi  can  be  approximated  using  a  partial  derivative  of  the  error  with 
respect  to  a  relevance  factor. 

The  results  shown  in  Section  3.5  of  this  chapter  indicate  that  the  least  important  features,  as 
ranked  by  any  of  the  feature  saliency  metrics,  may  not  be  essential  to  good  classification  accuracy. 
In  the  next  chapter,  feature  screening  techniques  are  presented  which  can  be  used  to  formally 
identify  unimportant  or  noue-like  features. 
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IV.  Feature  Screening  Techniques  for  Feedforward  Neural  Networks 


4.1  Introduction 

Feature  screening  can  be  used  to  take  a  preliminary  look  at  a  set  of  features.  Specifically,  it 
can  be  used  to  identify  and  eliminate  noisy  features  prior  to  a  formalised  feature  selection  procedure. 
Two  feature  screening  procedures  are  discussed  and  evaluated  in  this  section.  The  first  procedure, 
presented  in  Section  2,  is  based  on  statistically  comparing  the  saliency  of  candidate  features  to 
the  saliency  of  a  noisy  feature.  The  second  procedure,  presented  in  Section  3,  is  based  on  the 
weight  screening  hypothesis  test  proposed  by  White  [82].  These  procedures  are  summarised  and 
recommendations  are  made  in  Section  4. 

As  in  Chapter  IQ,  the  research  and  theoretical  results  presented  in  this  chapter  reflect  the 
exclusive  use  of  the  sigmoidal  activation  functions  on  the  middle  and  output  nodes  of  a  feedforward 
neural  network  (presented  in  Chapter  I).  The  fundamental  definitions  of  network  output  activations, 
saliency,  and  network  derivatives  will  change  for  other  t3rpes  of  activation  fimctions.  However,  the 
imderlying  procedures  presented  in  this  chapter  remain  the  same.  The  neural  network  notational 
conventions,  network  structure,  and  the  backpropi^ation  algorithm  introduced  in  Section  3  of 
Chapter  I  are  used  as  necessary  in  this  chapter. 

4-2  Feature  Saliency  Screening 

4.2.1  Introduction.  A  saliency  screening  procedure  which  statistically  formalises  the  screen¬ 
ing  methodology  proposed  by  Belue  and  Bauer  is  presented  in  this  section  [5].  In  their  procedure, 
Belue  and  Bauer  augment  a  set  of  features  with  a  known  irrelevant  feature  (i.e.  noise).  They 
recommend  eliminating  any  feature  whose  mean  saliency  falls  inside  a  one-sided  confidence  interval 
about  the  mean  saliency  of  noise.  Mean  saliency  is  calculated  from  the  feature  saliency  results  of 
several  trained  neural  networks. 
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In  the  fonnidized  procedure  presented  here,  the  entire  feature  set  is  simultaneously  screened 
for  nonsalient  features  using  a  specified  degree  of  statistical  confidence.  The  formalised  procedure 
requires  an  injected  noise  feature,  a  Bonferroni  test  statistic,  and  an  appropriate  hypothesis  for 
testing  the  equality  of  means.  These  are  reviewed  before  the  formalized  screening  procedure  is  pre¬ 
sented.  The  saliency  screening  procedure  is  evaluated  on  the  XOR  problem  and  on  two  application 
problems  to  determine  if  artificially  included  noise  features  can  be  identified. 

The  saliency  metric  is  used  in  this  discussion,  although  a  different  saliency  metric  coidd 
easily  have  been  used.  For  reference,  defined  in  Equation  31  of  Chapter  Eli  is  given  again  as: 


^data 


P  K 

=  ^-‘EE 

p=lk=l 


dXi 


(x^w) 


4.2.2  Noise  Injection.  There  are  two  reasons  for  augmenting  the  feature  set  with  noise. 
First,  the  saliency  of  noise  features  is  significantly  greater  than  zero;  therefore,  a  test  cannot  be 
developed  based  on  Af***  =  0.  Second,  the  actual  magnitude  of  this  saliency  can  be  different  from 
problem  to  problem;  therefore,  the  magnitude  of  noise-like  features  should  be  chareu:terized  for  the 
problem  at  hand.  This  entails: 


•  Augmenting  the  origined  feature  set  with  a  true  noise  feature,  z.  .  One  way  to  generate  a 
noise  feature  is  by  drawing  random  numbers  from  a  Uniform  (0,1)  distribution. 

•  Training  a  neural  network  using  the  augmented  feature  set  to  minimize  the  training-test  set 
error. 

•  Csdculating  the  feature  saliency  for  all  the  features  (including  the  noise). 

4.2.3  Bonferroni  Joint  Hypothesis  Testing.  The  Bonferroni  procedure  accounts  for  the  fact 
that  the  significance  level  for  a  ‘family’  of  tests  is  not  the  same  as  that  for  an  individual  test. 
To  conduct  M  one-sided  hypothesis  tests  with  a  ‘family’  significance  level  of  a,  each  individual 
test  must  be  conducted  with  an  individual  significance  level  of  fj  [47:594].  The  M  hypothesis  tests 
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correspond  to  the  M  candidate  features  in  the  feature  set.  The  individual  significance  level  is  found 
using  the  Bonferroni  inequality  which  is  derived  from  simple  probability  theorems. 

Let  Ai  be  the  event  that  the  tth  hypothesis  test  is  rejected  when  it  is  true.  The  probability 

of  is  denoted  as  P{Ai)  and  the  confidence  coefficient  for  Ai  be  defined  1  -  P{Ai).  The  following 

set  of  probability  theorems  are  used  to  derive  the  Bonferroni  inequality. 

Probability  Theorem  1 


P{Ai  U  =  P{Ai)  +  P(A,)  -  P{Ai  n  Aj) 

Probability  Theorem  2 


Probability  Theorem  3 


P{Ai)  =  1  -  P(Ai) 


P{Ai  U  Aj)  =  PiAi  n  Aj) 


Using  these  basic  probability  theorems,  the  following  statement  can  be  derived. 

P{Ai  n  Aj)  =  1  -  P{Ai)  -  P{Aj)  +  P{Ai  n  Aj) 

Since  P{Ai  fl  Aj)  is  greater  than  or  equal  to  zero,  the  Bonferroni  inequality  is  defined  as  [47:160]: 

P{AinAj)  >  I  -  P{Ai)  -  P{Aj) 

The  right  hand  side  of  the  Bonferroni  inequality  represents  a  conservative  estimate  of  the  joint 
confidence  coefficient  for  the  individual  hypothesis  tests  Ai  and  Aj. 

The  Bonferroni  inequality  can  be  generalized  for  simultaneous  hypothesis  testing  of  M  indi¬ 
vidual  hypothesis  tests  denoted  Ai,  Aj,  •  •  -Au  as 

P{Ai  n  Aj  n  •  ■  •  n  A^)  =  i  —  (  p(Ai)  -i-  ^(Aj)  -f-  •  •  •  -i-  p{Au)  ) 
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l^ble  16.  Bonferroni  Critical  Values 


Individual  Significance  Levels 

Degrees  of  Freedom  v 

^=0.01 

=  0.005 

^  =  0.0005 

10 

2.764 

3.169 

4.587 

20 

2.528 

2.845 

3.850 

30 

2.457 

2.750 

3.646 

40 

2.423 

2.704 

3.551 

60 

2.390 

2.660 

3.460 

120 

2.358 

2.617 

3.373 

00 

2.326 

2.576 

3.291 

a:  ‘family’  level  of  significance,  where  a  =  M  •  (^) 


M  is  the  number  of  individual  hypothesis  tests 


Now  let  P{Ai)  =  for  each  of  the  M  individual  h3rpothesis  tests.  The  result  is  that  the  joint 
hypothesis  test  has  a  confidence  coefficient  of  1  —  M  0.  For  the  joint  test  to  have  a  confidence 
coefficient  of  1  -  a  (or  a  ‘family’  significance  level  of  a),  the  individual  hypothesis  tests  must  have 
significance  levels  of  0  equal  to  ^ . 

To  produce  a  ‘family’  significance  level  of  a,  the  Bonferroni  critical  value,  denoted  B,  is  used 
for  each  individu  al  hjrpothesis  test.  For  M  one-sided  tests,  the  Bonferroni  critical  value  is; 


B  =  (57) 

where  v  =  N  -  \  and  N  is  the  number  of  observations  on  which  the  tests  are  based.  Table  16 
contains  Bonferroni  critical  values  B  for  combinations  of  p  and  ^ . 
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4-2.4  Hypothesis  Tests  for  the  Equality  of  Means.  The  saliency  screening  methodology  con¬ 
sists  of  simultaneously  applying  individual  hypothesis  tests  on  the  equality  of  two  means:  (1)  the 
mean  saliency  of  feature  t,  denoted  (2)  mean  saliency  of  noise,  denoted  /iA'***'  ^ 

general,  this  type  of  hypothesis  test  requires  two  conditions: 

1.  The  saliency  for  each  feature  t  should  be  a  normally  distributed  random  variable,  or  if  the 
saliency  is  nonnormal,  the  conditions  of  the  central  limit  theorem  (CLT)  should  hold  [24:280]. 

2.  The  saliency  for  the  tth  feature  is  independently  distributed  with  mean  variance 

Similarly,  the  saliency  for  the  augmented  noise  feature  is  independently  distributed 
with  mean  variance  [24:281]. 


For  the  saliency  screening  methodology,  the  first  condition  must  be  met  by  using  the  CLT, 
since  the  theoretical  distribution  of  has  not  been  shown  to  be  normal.  The  CLT  states  [47:6]: 


Theorem  1  IfY^,  -  •  •  ,Yn  are  independent  random  observations  from  a  population  with  probabiiity 
function  f{Y)  for  which  <r*  {F}  is  finite,  the  sample  mean  Y: 


Y  = 


_  E^=lYn 


N 


is  approximately  normally  distributed  when  the  sample  size  N  is  reasonably  large,  with  mean  {y} 
and  variance  "  ][p-. 

To  invoke  the  CLT,  the  saliency  metric  Aj***  is  defined  as 


j^data  _  ^f=l  •*» 


where 


k=l 
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The  superscript  p  corresponds  to  the  pth  exemplar  from  a  set  of  P  exemplars.  The  variable  is 
a  random  observation  from  a  population  with  probability  function  f{Yi)  for  which  {!<}  is  finite. 
The  variable  Yf  is  a  random  observation  since  it  is  a  frmction  of  x'  which  is  a  random  observation 
from  the  population  of  input  exemplars.  The  conditions  of  the  CLT  can  be  applied  in  a  similar 
fashion  for  many  of  the  metrics  shown  in  Table  9  in  Chapter  m. 

The  second  condition  for  ‘equality  of  means’  hypothesis  testing  is  also  met.  The  samples  of 
saliency  for  each  feature  are  associated  with  independent  realizations  of  neural  networks.  Each 
realization  is  independent  for  three  reasons:  (1)  the  random  partitioning  of  the  data  set,  (2)  the 
random  initialization  of  the  weight  parameters,  and  (3)  the  random  order  for  presentation  of  the 
training  data. 

The  test  statistic  and  the  hypothesis  test  used  to  test  for  the  equality  of  two  means  is  defined 
by  one  of  four  cases  [24:280-292].  They  are: 

1.  The  variances  and  corresponding  to  the  distributions  of  ala  and  Mas  ala 

known.  The  observations  of  Af®*®  and  A^®‘®  are  independent. 

2.  The  variances  =  o’aj***)  corresponding  to  the  distributions  of  Af®*®  and  A^®*®  are  un¬ 

known  but  equal.  The  observations  of  Aj®*®  and  Aj®‘®  are  independent. 

3.  The  variances  (o'^d.t.  ^  distributions  of  AJ*®*®  and  A^®*®  are  unknown  and  un¬ 

equal.  The  observations  of  A^®*®  and  A^®*®  are  independent. 

4.  The  observations  of  Af®*®  and  Aj®*®  are  paired  and  dependent. 

Cases  1,  2,  and  3,  which  each  correspond  to  a  null  hypothesis  of  Ho  : 
are  inappropriate,  because  in  this  application  the  observations  of  Af®*®  and  A^®‘®  are  paired  and 
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dependent.  The  appropriate  hypothesis  testing  procedure  is  the  Paired  t-test  defined  using  Case  4. 
The  Paired  t-test  is  defined  here  as: 


Null  Hypothesis  Ho  :  HOi  =  0 

Alternative  Hypothesis  :  ftoi  >  0  , 


where  the  difference  between  the  tth  feature’s  mean  saliency  and  the  noise  feature’s  mean 
saliency,  is  defined 


Define  the  test  statistic  t*  as 


where 


■  N 

Dij  = 

-  DiY 

{N  -  1)N 


(58) 

(59) 

(60) 


and  j  indicates  the  jth  of  the  N  samples  of  Dij.  For  feature  t,  Di  and  are  the  sample  mean 
and  sample  variance,  respectively,  for  N  samples.  The  critical  value  for  an  individual  significance 
level  of  a  is  tcrit  =  h-a,»t  where  v  —  N  -1.  The  null  hypothesis  is  rejected  if  the  test  statistic,  t*, 
exceeds  the  critical  value  tent- 


4.2. 5  Methodology.  The  formalized  statistical  screening  procedure  for  identifying  nonsalient 
features  is  based  on  applying  the  Bonferroni  inequality  to  M  individual  h3rpothesis  tests  to  achieve 
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a  predetermined  ‘family’  significance  level  a.  The  individual  hypothesis  test  used  is 

Null  Hypothesis  Ho  :  Pd<  =  0 
Alternative  Hypothesis  :  Udi  >  0  , 

The  procedure  is  summarised  as: 

Saliency  Screening  Procedure 

1.  Augment  feature  set  with  a  noise  feature,  z.  . 

2.  Train  neural  net  to  minimise  training-teat  set  error. 

All  nets  should  ideally  use  a  minimal  network  structure  with  no  redundant  middle  nodes  or 
features. 

AU  nets  should  ideally  converge  to  a  local  minimum  and  not  a  saddle  point. 

3.  Compute  the  feature  saliency,  A^***  for  each  of  the  features,  including  z,  . 

4.  Repeat  steps  2  and  3  a  minimum  of  ten  times  {N  =  10),  using  random  initialisation  of  weight 
parameters  and  random  data  set  partitioning. 

5.  Select  ‘family’  significance  level,  a. 

6.  For  each  feature  do  an  individual  hypothesis  test  as  follows: 

(a)  Compute  Di  and  using  Equations  59  and  60  on  page  113. 

(b)  Compute  the  test  statistic  V  using  Equation  58  on  page  113. 

(c)  Determine  the  Bonferroni  critical  value  B  = 

(d)  Evaluate  the  test  statistic  as  follows: 

•  Hi*  <  B,  the  null  hypothesis  can  not  be  rejected  for  feature  t. 

Conclusion:  feature  i  is  nonsalient,  since  the  difference  between  the  tth  feature’s 
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saliency  and  the  noise  feature’s  saliency  is  not  statistically  different  from  sero  at  the 
a  ‘family’  significance  level. 

•  If  t*  >  B,  reject  the  null  hypothesis  for  feature  t. 

Conclusion;  feature  t  is  salient,  since  there  is  a  statistical  difference  at  the  a  ‘family’ 
significance  level  between  the  saliency  of  the  tth  feature  and  the  saliency  of  the  noise 
feature. 

7.  Eliminate  the  nonsalient  features  and  retrain  the  network  with  only  the  salient  features. 

If  there  sue  any  questions  concerning  the  salient  features,  the  procedure  can  be  performed 
again  on  the  reduced  feature  set  to  confirm  the  remaining  features  are  all  still  salient. 

The  identification  of  nonsalient  features  using  saliency  screening  is  related  to  the  sample  size. 
A  larger  sample  size  usually  corresponds  to  a  smaller  standard  error  for  5^^  and  a  smaller  standard 
error  corresponds  to  a  larger  test  statistic  for  t*  in  Equation  58.  Additionally,  a  larger  sample 
means  a  smaller  t-statistic  or  Bonferroni  critical  value  B.  For  a  feature  which  is  borderline  or  very 
nearly  noiae-like,  Hq  may  be  more  easily  rejected  with  larger  sample  sizes. 

The  saliency  metrics  used  in  this  procedure  are  sensitive  to  training  and  middle  nodes  as 
discussed  in  Chapter  HI;  therefore,  this  procedure  may  also  be  sensitive  to  training  and  middle 
nodes  redundancies.  These  factors  are  partially  minimized  for  evaluation  of  the  saliency  screening 
procedure,  since  a  minimal  number  of  middle  nodes  and  sufficient  training  epochs  are  determined 
through  a  series  of  pilot  runs. 

4-2.6  Analysis.  The  robustness  of  the  saliency  screening  methodology  is  evaluated  with 
three  problems.  These  problems  include:  the  XOR  problem,  the  Armor  Piercing  Incendiary  (API) 
Projectile  Functioning  problem,  and  the  FLIR  problem.  The  second  and  third  problems  are  used 
to  evaluate  whether  the  saliency  screening  procedure  is  practical  for  application  problems.  For 
each  test  problem,  the  feature  set  is  augmented  with  one  or  more  additional  noise  features.  Then 
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the  saliency  Kreening  procedure  is  used  to  determine  if  additional  noise  features  are  identified  as 
nonsalient. 


4-2.6. 1  XOR  ProbUtn.  In  this  section,  the  saliency  screening  procedure  is  applied  to 
an  XOR  problem  (shown  in  Figure  9  of  Chapter  DI)  which  has  been  augmented  with  five  additional 
noise  features.  One  of  the  artificial  noise  features  serves  as  the  augmented  noise  x.  ,  and  the  others 
serve  as  csmdidate  features.  The  procedure  is  evaluated  to  see  if  the  candidate  noise  features  are 
identified  during  the  screening  process. 

The  saliency  screening  procedure  was  repeated  with  N  equal  to  10, 30,  and  100  realisations  of 
‘trained’  neural  networks  which  were  all  trained  with  the  same  data  set,  but  with  different  random 
initial  weights  and  different  training  order  presentations.  A  data  set  of  600  vectors  was  partitioned 
into  training,  training- test,  and  validation  sets  of  sixe  400, 100,  and  100,  respectively.  The  networks 
were  trained  with  four  middle  nodes,  which  was  the  minimal  network  which  would  reliably  train 
to  a  local  minima.  A  log-linear  declining  learning  rate  and  a  momentum  rate  of  0.30  were  used. 
For  all  runs,  700  epochs  were  used  with  a  requirement  for  a  seven  percent  or  lower  training  set 
classification  error  for  the  network’s  solution  to  be  considered  from  a  ‘trained’  network. 

Saliency  screening  results  for  iV^  =  30  realisations  are  reported  in  l^ble  17.  In  total,  38 
networks  were  trained,  and  30  networks  met  the  minimum  error  requirement.  The  average  training 
and  training-test  set  error  rates  averaged  over  the  30  networks  were  2.69%  and  6.83%,  respectively. 
Using  the  screening  procedure,  the  artificial  noise  features  are  identified  as  nonsalient,  and  the 
features  x  and  y  are  identified  as  salient.  Equivalent  results  are  produced  for  the  saliency  screening 
procedure  with  N  =  10  and  N  =  100  neural  network  realisations.  The  other  saliency  metrics  (see 
Table  9  of  Chapter  m)  also  provide  similar  restilts  when  they  are  used  with  the  saliency  screening 
procedure. 

The  feature  saliency  rankings  from  Table  17  are  used  for  successive  removal  of  candidate 
features.  In  Figure  11  the  mean  error  and  a  95%  confidence  interval  band  (plotted  as  horisontal 
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Table  17.  Saliency  Screening  Resulta  on  XOR  Problem  for  30  Rnna 


Feature 

t 

A, 

Ranking 

A, 

^  •  datn 

A 

f*  —  JBi. 

p 

Reject 

Ho 

z 

2 

2.055 

0.610E-01 

1.914 

0.611E-01 

31.3 

29 

yes 

y 

1 

2.149 

0.476E-01 

0.456E-01 

44.1 

29 

yes 

noise  1 

6 

0.133 

0.148E-01 

0.148E-01 

29 

no 

noise  2 

IHKHI 

0.128 

O.lllE-01 

-1.570 

29 

no 

noise  3 

IHKHI 

0.142 

0.145E-01 

-0.893E-03 

0.142E-01 

0.630E-01 

29 

no 

noise  4 

3 

0.151 

0.148E-02 

-0.104E-01 

0.179E-01 

0.582 

29 

no 

*. 

0.141 

0.122E-01 

N/A 

N/A 

N/A 

N/A 

N/A 

Features  ranked  firom  best  to  worst 

Individual  Significance  Level  ^  =  .005 

Bonferroni  Critical  Value  R  =  2.756 
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Figure  11.  Average  Validation  Set  Error  on  XOR  Problem  as  Features  are  Eliminated 

bars  about  the  means)  are  plotted  on  a  validation  set  of  100  vectors.  The  mean  error  at  each  point 
is  labeled  with  the  features  retained  in  the  neural  network  model.  The  database  partitioning  and 
confidence  interval  procedures  discussed  in  Chapter  I  are  used  to  estimate  the  average  network 
classification  errors  using  the  best  30  of  100  realisations  of  neural  network  training.  The  network’s 
accuracy  generally  improves  as  the  nonsalient  candidate  features  are  removed  from  the  training 
set  one-by-one.  The  95%  confidence  interval  for  the  average  error  is  significantly  smaller  when  all 
the  noise  has  been  eliminated  from  the  feature  set.  Another  way  of  saying  this  is  that  prediction 
variance  is  reduced  after  noise-Uke  features  are  eliminated.  The  network  accuracy  is  only  seriously 
degraded  when  the  first  salient  feature  x  is  removed. 
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Figure  12.  API  Projectile  Firing 

reprinted  from  Belue  [4] 

6.2  Armor  Piercing  Incendiary  (API)  Problem.  In  this  section,  the  saliency  screen¬ 
ing  procedure  is  applied  to  the  API  problem  whidi  uses  data  on  the  performance  of  API  projectiles 
to  classify  an  API  projectile’s  performance  as  “complete”  or  “other  [34].”  For  each  shot,  four  in¬ 
dependent  parameters  are  known;  impact  striking  velocity  (V5),  impact  striking  mass  (Ms),  panel 
ply  thickness  in  inches  (PLY),  and  the  secant  of  the  impact  obliquity  angle  (SECT).  The  firing 
process  is  illustrated  in  Figure  6.  A  feature  vector  for  this  application  consists  of  the  parameters 
of  each  API  projectile  firing.  The  API  problem  has  been  augmented  with  two  additional  noise 
features.  One  of  the  artificial  noise  features  serves  as  the  augmented  noise  x,  ,  and  the  other  serves 
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as  a  candidate  feature.  The  saliency  screening  procedure  is  evaluated  to  see  if  the  candidate  noise 
feature  is  identified  during  the  screening  process. 

The  saliency  screeiting  procedure  was  repeated  for  this  application  with  N  equal  to  10,  30, 
and  100  realisations  of  ‘trained’  neural  networks  which  were  all  trained  with  the  same  data  set, 
but  with  different  random  initial  weights  and  different  training  order  presentations.  A  data  set  of 
281  vectors  was  partitioned  into  training,  training-test,  and  validation  sets  of  sise  181,  50,  and  50, 
respectively.  The  networks  were  trained  with  five  middle  nodes,  which  was  the  minimal  network 
which  would  reliably  train  to  a  local  minima.  A  log-linear  declining  learning  rate  and  a  momentum 
rate  of  0.30  were  used.  For  all  runs,  700  epochs  were  used  with  a  requirement  for  a  seven  percent 
or  lower  training  set  classification  error  for  the  network’s  solution  to  be  considered  from  a  ‘trained’ 
network. 

Saliency  screening  results  for  N  =  ZQ  realisations  are  reported  in  Table  18.  In  total,  30 
networks  were  trained,  and  all  30  networks  met  the  minimum  error  requirement.  The  average 
training  and  training-test  set  error  rates  averaged  over  the  30  networks  were  2.32%  and  8.4%, 
respectively. 

Using  the  screening  procedure,  the  artificial  noise  feature  and  the  feature  Ms  are  identified 
as  nonsalient.  All  other  features  are  identified  as  salient.  These  results  are  consistent  with  Belue 
and  Bauer  [7].  Using  the  saliency  screening  procedure  with  N  =  10  and  N  =  100  neural  network 
realisations  produced  equivalent  results  to  the  results  shown  in  Table  18.  Using  the  saliency  screen¬ 
ing  procedure  with  the  other  metrics  presented  in  Table  9  of  Chapter  IQ  also  produced  equivalent 
results. 

The  feature  saliency  rankings  from  Table  18  are  used  for  successive  removal  of  candidate 
features.  In  Figure  13  the  mean  validation  error  (50  vectors)  and  a  95%  confidence  interval  band 
(plotted  as  horisontal  bars  about  the  means)  are  plotted.  The  mean  error  at  each  point  is  labeled 
with  the  features  retained  in  the  neural  network  model.  The  database  partitioning  and  confidence 
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Table  18.  Saliency  Screening  Results  on  API  Problem  for  30  Runs 


Feature 

t 

7  data 

A. 

Ranking 

Bj 

Di 

S&i 

SOi 

1 

Reject 

PLY 

2 

0.583 

0.290E-01 

0.432 

0.310E-01 

13.9 

29 

yes 

Vs 

3 

0.472 

0.147E-01 

0.320 

0.217E-01 

14.8 

29 

yes 

Ms 

5 

0.150 

0.119E-01 

-0.939E-03 

0.179E-01 

-0.525E-01 

29 

no 

SECT 

1 

0.960 

0.369E-01 

0.809 

0.408E-01 

19.8 

29 

yes 

noise 

6 

0.148 

0.926E-02 

•0.335E-02 

0.164E-01 

-0.205 

29 

no 

4 

0.151 

0.144E-01 

N/A 

N/A 

N/A 

N/A 

N/A 

Features  ranked  from  best  to  worst 

Individual  Significance  Level  ^  =  .005 

Bonferroni  Critical  Value  B  =  2.756 

interval  procedures  discussed  in  Chapter  I  are  used  to  estimate  the  average  network  classification 
errors  using  the  best  30  of  100  realizations  of  neural  network  training.  The  95%  confidence  interval 
for  the  average  error  is  significantly  smaller  when  the  noise  feature  and  Ms  are  eliminated  from 
the  features  set.  Once  again,  u  with  the  XOR  problem,  the  prediction  variance  is  reduced  after 
noise-like  features  are  eliminated.  It  is  only  after  the  next  feature  V5  is  removed  that  the  average 
network  accuracy  becomes  seriously  degraded. 

4 -2. 6. 3  FLIR  Problem.  The  FLIR  problem  used  in  Chapter  m  is  revisited  to  evaluate 
the  saliency  screening  procedure.  The  FLIR  feature  set  described  in  Table  11  is  augmented  with 
two  artificial  noise  features.  One  of  the  artificial  noise  features  serves  as  the  augmented  noise  a;.  , 
and  the  other  serves  as  a  candidate  feature.  The  saliency  screening  procedure  is  evaluated  to  see  if 
the  additional  noise  feature  is  identified  during  the  screening  process. 
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Validation  Set  Classification  Error 


Number  of  Features  Eliminated  from  Feature  Set 

Figure  13.  Average  Validation  Set  Error  on  API  Problem  as  Features  are  Eliminated 
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The  saliency  screening  procedure  was  repeated  with  N  equal  to  10,  30,  and  100  realizations  of 
‘trained’  neural  networks  which  were  all  trained  with  the  same  data  set,  but  with  different  random 
initial  weights  and  different  training  order  presentations.  A  data  set  of  550  vectors  was  partitioned 
into  training,  training- test,  and  validation  sets  of  size  350, 100,  and  100,  respectively.  The  networks 
were  trained  with  four  middle  nodes,  which  was  the  minimal  network  which  would  reliably  train 
to  a  local  minima.  A  log-linear  declining  learning  rate  and  a  momentum  rate  of  0.30  were  used. 
For  all  runs,  500  epochs  were  used  with  a  requirement  for  a  seven  percent  or  lower  training  set 
classification  error  for  the  network’s  solution  to  be  considered  from  a  ‘trained’  network. 

Saliency  screening  results  for  N  =  30  realizations  are  reported  in  Table  19.  In  total,  30 
networks  were  trained,  and  all  30  networks  met  the  miniirtnni  error  requirement.  The  average 
training  and  training-test  set  error  rates  averaged  over  the  30  networks  were  3.86%  and  8.77%, 
respectively. 

Using  the  screening  procedure  for  N  =  30,  the  artificial  noise  feature  is  the  only  feature 
identified  as  nonsalient,  and  all  other  features  are  identified  as  salient.  The  third  feature  ‘maTiTniiTn 
brightness’  has  a  fairly  small  test  statistic  of  5.37  which  is  only  half  the  size  of  the  next  largest 
test  statistic,  but  it  would  not  be  identified  as  nonsalient  with  the  saliency  screening  procedure. 
However,  it  might  warrant  further  screening  after  the  noise  feature  is  removed. 

Using  the  saliency  screening  procedure  with  N  =  100  neural  network  realizations  produced 
equivalent  results.  However,  when  the  saliency  screening  procedure  is  used  with  i\r  =  10  neural 
network  realizations  the  test  statistic  for  ‘maximum  brightness’  drops  to  3.02.  With  em  individual 
significance  level  ^  =  .005  it  still  would  not  be  identified  as  nonsalient,  but  at  a  higher  significance 
level,  say  ^  =  .0005,  it  would  be  identified.  This  result  demonstrates  the  relationship  between 
sample  size  and  the  identification  of  nonsalient  features  using  the  sahency  screening  procedure. 

When  the  saliency  screening  procedure  was  performed  using  the  other  metrics  presented  in 
Table  9  of  Chapter  HI,  the  results  are  similar.  The  differences  are  that  feature  3  is  identified  as 
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T^ble  19.  Saliency  Screening  Results  on  FLIR  Problem  for  30  Runs 


Feature 

i 

7  data 

A, 

Ranking 

7  data 

At 

(T  2  d«ta 

Di 

Sd, 

II 

1 

Reject 

Ho 

1 

7 

0.629 

0.374E-01 

0.537 

0.367E-01 

14.6 

29 

yes 

2 

1 

1.537 

0.564E-01 

1.444 

0.572E-01 

25.2 

29 

yes 

3 

9 

0.256 

0.303E-01 

0.163 

0.304E-01 

5.37 

29 

yes 

4 

8 

0.481 

0.169E-01 

0.389 

0.163E-01 

23.9 

29 

yes 

5 

2 

1.447 

0.506E-01 

1.355 

0.499E-01 

27.1 

29 

yes 

6 

6 

0.853 

0.475E-01 

0.761 

15.6 

29 

yes 

7 

4 

0.946 

0.392E-01 

0.853 

0.408E-01 

20.9 

29 

yes 

8 

3 

1.208 

0.941E-01 

1.115 

0.939E-01 

11.9 

29 

yes 

9 

5 

0.857 

0.427E-01 

0.764 

0.441E-01 

17.3 

29 

yes 

nois” 

10 

0.115 

0.106E-01 

0.229E-01 

0.141E-01 

1.62 

29 

no 

*. 

11 

0.925E-01 

0.763E-02 

N/A 

N/A 

N/A 

N/A 

N/A 

Features  ranked  from  best  to  worst 
Individual  Significance  Level  ^  =  .005 
Bonferroni  Critical  Value  B  =  2.756 


nonsalient  for  N  =  10  using  the  metrics  pi  and  »i,  and  feature  8  is  identified  as  nonsalient  with  the 
metric  Si.  The  conclusion  reached  is  that  the  results  may  differ  slightly  depending  on  the  saliency 
metric  used. 


The  feature  saliency  rankings  from  l^ble  19  are  used  for  successive  removal  of  candidate 
features.  In  Figure  14  the  mean  error  and  a  95%  confidence  interval  band  (plotted  as  horizontal 
bars  about  the  means)  are  plotted  on  a  validation  set  of  100  vectors.  The  mean  error  at  each  point 
is  labeled  with  the  features  retained  in  the  neural  network  model.  The  database  partitioning  and 
confidence  interval  procedures  discussed  in  Chapter  I  are  used  to  estimate  the  average  network 
classification  errors  using  the  best  30  of  100  realizations  of  neural  network  training.  The  average 
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Figure  14.  Average  Validation  Set  Error  on  FLER.  Problem  as  Features  are  Eliminated 

network  accuracy  does  not  significantly  change  by  removal  of  the  three  least  salient  features,  includ¬ 
ing  noise.  In  fact,  the  accuracy  is  not  significantly  degraded  until  the  sixth  feature  is  removed  firom 
the  model.  The  explanation  for  this  is  two-fold.  First,  only  the  least  salient  features  are  removed 
and  second,  the  features  are  intercorrelated,  so  a  certain  degree  of  natural  redundancy  may  exist 
in  this  feature  set. 

In  Figure  15,  an  ordered  plot  of  the  features  versus  the  test  statistics  from  the  saliency 
screening  procedure  is  shown.  The  relationship  between  ‘essential’  and  ‘nonessential’  features  can 
be  seen  visually.  Belue  and  Bauer  use  a  similar  plot  with  features  versus  feature  saliencies  as  a 
visual  ‘scree  test,’  to  determine  which  features  to  retain  [5].  For  the  FLIR  problem,  there  seems 
to  be  three  visual  categories  of  features  based  on  the  test  statistics.  Based  on  using  Figure  15, 
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Figure  15.  Scree  Plot  of  Saliency  Screening  Test  Statistics  for  FLIR  Features 

features  3  and  noise  might  be  categorised  as  ‘nonessential’  and  removed,  and  the  features  5,  2,  4, 
and  7  might  be  categorized  as  ‘essential’  and  retained.  The  third  category  of  features  might  be 
categorized  as  ‘important’  but  warrant  further  analysis  in  the  context  of  an  overall  feature  selection 
process. 

Summary.  The  saliency  screening  procedure  is  a  statistical  procedure  for  identifying 
nonsalient  features  in  a  feature  set  using  a  specified  level  of  statistical  confidence.  A  series  of 
paired-t  tests  are  used  to  test  for  a  statistical  difference  between  the  saliency  of  a  true  feature  and 
the  saliency  of  noise.  A  feature’s  paired-t  test  statistic  is  a  comprehensive  measure  of  a  feature’s 
saliency  for  two  reasons.  First,  it  incorporates  the  relative  saliency  of  each  feature  compared  to  a 
known  nonsalient  feature,  and  second,  it  also  incorporates  the  variance  of  feature  saliency. 

An  empirical  evaluation  of  the  saliency  screening  procedure  indicates  that  coiuervative  results 
are  common  with  this  procedure.  That  is,  a  nonessential  feature,  having  little  or  no  bearing  on  the 
classification  accuracy,  may  be  identified  as  salient  if  it  is  statistically  different  from  noise.  This 
occurs  in  the  FLIR  problem,  since  the  third  and  eighth  features  are  statistically  different  than  noise 
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in  Table  19,  but  have  little  impact  on  classification  accuracy  as  shown  in  Figure  14.  For  this  reason, 
features  with  relatively  low  test  statistics  may  warrant  further  consideration  in  the  context  of  a 
feature  selection  process. 

4.S  Weight  Screening 

4-3.1  Introduction  White  describes  a  weight  screening  procedure  for  identifying  irrelevant 
features  [82].  The  procedure  uses  a  statistical  hypothesis  test  which  is  developed  using  the  limiting 
distribution  of  the  weights  w  [82:442].  The  premise  is  that  the  vector  of  feature  weights  is  not 
statistically  different  from  zero  for  an  irrelevant  feature.  An  overview  of  this  section  follows. 

First,  the  irrelevant  input  hypothesis  test  and  the  associated  distributional  assumptions  are 
reviewed.  Then,  a  relationship  is  shown  between  weight  screening  and  weight-based  saliency.  Next, 
a  weight  screening  methodology  using  the  irrelevant  input  hypothesis  test  is  described.  The  weight 
screening  problem  is  then  evaluated  on  the  same  problems  as  were  previously  used  with  the  saliency 
screening  procedure.  Finally,  the  results  are  stunmarized  along  with  a  comparison  of  the  two 
screening  procedures. 

4-3.2  Background  In  this  section,  the  weight  screening  hypothesis  test  for  identifying  irrel¬ 
evant  inputs  is  presented.  Related  distributional  assumptions  are  stated  as  required.  The  weight 
screening  hypothesis  test  involves  weight  parameters  from  one  realization  of  a  trained  neural  net¬ 
work.  The  hypothesis  test  as  given  in  White  [82]  follows: 

Null  Hypothesis  Ho  :  Sw*  =  0 
Alternative  Hypothesis  :  Sw*  ^  0  , 

where  w*  is  an  s-dimensional  vector  of  neural  network  optimal  weight  parameters  and  S  is  a  9  by 
s  selection  matrix  picking  out  the  g  elements  of  w*  which  are  hypothesized  to  be  zero  under  Hq. 
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To  statistically  test  Hq  for  a  neural  network,  the  limiting  distribution  of  the  vector  of  esti¬ 
mated  network  weights  w  must  be  multivariate  normal  as  w  converges  to  w*  [82].  The  vector  of 
estimated  weights  w  will  be  multivariate  normally  distributed  ‘in  the  limit’  if  redundant  feature 
inputs  and  redundant  hidden  tinits  are  removed  [79:441].  ‘In  the  limit’  refers  to  weights  from  a 
network  trained  using  an  infinitely  large  data  set.  When  w  has  a  multivariate  normal  limiting 
distribution,  then 

\/P(w-w*)  ~  iV,(0,C*), 

where  C*  is  the  s-dimensional  covariance  matrix  of  V^(w  -  w*)  (and  ■s/Pit)  and  P  is  the  number 
of  data  exemplars  [82:441].  Also, 

■\/PS(w-w’)  ~  Ar,(0,SC*S') 

and 

P(w  -  w*)'S'(SC*S’)-^S(w  -  w*)  ~  xl 

Under  the  null  hypothesis  where  Sw*  =  0,  the  limiting  distributions  are  simplified  to 

VPSw  ~  iV,(0,SC*S') 

and 

Pw'S'(SC*S')-^Sw  X*  (61) 

where  the  x^  test  statistic  is  defined  by  the  left  hand  side  of  Equation  61. 

The  analytical  expression  for  C*  cannot  be  computed;  however,  a  weakly  consistent  estimator 
C  is  available  [79].  The  estimator  denoted  C  is  defined  as 

C  =  (62) 


128 


The  ^-dimensional  matrices  A  and  B  are 


A  =  (63) 

r=i 

B  =  P-'f:V£J(x^w)V£J(x',w)'  (64) 

p=i 

where  £*  is  the  neural  network  error  associated  with  the  pth  exemplar  (see  Chapter  I  Equation  1), 
and  the  operators  V  and  denote  the  s  x  1  gradient  and  the  s  x  s  Hessian  operators  of  €*. 
Replacing  C*  with  C  does  not  affect  the  limiting  distribution;  however,  White  warns  that  a 
much  larger  sample  sise  P  may  be  required  to  obtain  a  good  approximation  of  C*  [79:442-443]. 

The  test  statistic  must  be  estimated  usii^  C.  It  is  defined  as 

=  Pw'S'(SCS')-^Sw  (65) 

If  exceeds  the  1  -  a  percentUe  of  x^,  the  irrelevant  input  hypothesis  is  rejected  where  the 
probability  of  failing  to  reject  Ho  when  Ho  is  false  is  equal  to  a. 

4.3.3  Relationship  between  Weight  Saliency  and  Wei^t  Screening  In  this  section,  a  relation¬ 
ship  is  shown  between  the  Euclidean  weight-based  saliency  proposed  by  Tarr  (see  Equation  22 
in  Chapter  H)  and  the  test  statistic  (on  the  left  hand  side  of  Equation  61)  [82,  72].  Let  Si  be  the 
g  by  s  selection  matrix  which  selects  the  g  weights  associated  with  feature  i.  The  weight  saliency 
Ti  and  the  y^  test  statistic  Msociated  with  the  tth  feature  are  redefined  in  terms  of  the  selection 
matrix  Si  as 

Ti  =  ^SjSiW, 

Pw'S;(SiC*SJ)-^SiW  ~  xj 
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A  relationihip  between  Tj  and  the  teat  atatistic  ia  developed  aa  followa 


T<  =  w'SjS<w 
=  w'S:i,S<w 
=  w'S:(SJ.S:)S,w 
=  w'SKSAS')-^SiW 


Now,  by  conaidering  the  special  case  where  C*  =  and  substituting  C*  for  I,  above  gives 

T<  =  w's;(SiC*s:)-%w 

which  is  equivalent  to 

PTi  =  Pw'Sj(SiC*S;)-^SiW 
Therefore,  the  special  case  where  C*  =  I,,  gives  the  relationship 

P^i  ~  4 

This  relationship  can  only  be  shown  for  C*  =1,.  This  fact  highlights  that  weight  saliency 
does  not  incorporate  the  variance  covariance  structure  of  the  weights  emanating  from  the  feature 
of  interest. 

4.S.4  Methodology.  A  formalised  statistical  weight  screening  procedure  for  identifying  ir¬ 
relevant  features  is  baaed  on  applying  the  Bonferroni  inequality  to  M  individual  hypothesis  tests 
to  achieve  a  predetermined  ‘family’  significance  level  a.  For  the  weight  screening  procedure,  the 


130 


individual  hypothesis  test  associated  with  the  tth  feature  is  given  as 


Null  Hypothesis  Ho  :  S^w*  =  0 
Alternative  Hypothesis  :  S^w*  ^  0 

The  null  hypothesis  is  referred  to  by  White  as  the  ‘irrelevant  input  hypothesis’  for  testing  whether 
the  tth  feature  is  irrelevant  [82]. 

The  weight  screening  procedure  can  be  summarised  in  a  similar  manner  to  the  saliency 
screening  procedure  as  follows: 


Weight  Screening  Procedure 

1.  Train  a  neural  net  to  minimise  training-test  set  error. 

The  net  should  use  a  minimal  network  structure  with  no  redundant  feature  inputs  and  no 
redundant  middle  nodes. 

The  net  should  be  trained  to  a  local  minimum. 

2.  Select  'family’  significance  level,  a. 

3.  For  each  of  the  M  features,  do  an  individual  hypothesis  test  as  follows: 

(a)  Compute  the  matrices  A  and  B  in  Equations  63  and  64. 

(b)  Compute  C  in  E^quation  62. 

(c)  Form  the  appropriate  selection  matrix  5  which  picks  out  the  q  elements  of  w*  which  are 
hypothesised  to  be  sero  under  Hq. 

(d)  Compute  test  statistic  in  liquation  65. 

(e)  Determine  the  Bonferroni  critical  value  ^ 

(f)  Evaluate  the  test  statistic  as  follows: 
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*  If  Xf*  ^  hypothesis  can  not  be  rejected  for  feature  t. 

Conclusion:  feature  i  is  irrelevant,  since  the  weights  associated  with  the  tth  feature 
are  not  statistically  different  from  0  at  the  a  ‘family’  significance  level. 

*  If  xj*  >  reject  the  null  hypothesis  for  feature  i. 

Concltuion:  feature  t  is  relevant,  since  the  weights  are  statistically  different  than  0 
at  the  a  ‘family’  significance  level. 

4.  Eliminate  the  irrelevant  features  and  retrtdn  the  network  using  only  the  relevant  features. 

If  there  are  any  questions  concerning  the  salient  features,  the  procedure  can  be  performed 
again  on  the  reduced  feature  set  to  confirm  the  remaining  features  are  all  still  salient. 

There  are  two  potential  problems  with  practical  implementation  of  the  weight  screening  pro¬ 
cedure: 

1.  Ill-conditioned  matrices  due  to  redundant  inputs  and/or  middle  nodes. 

2.  Intractable  matrix  inversion  due  to  a  very  large  network  structure. 

When  there  are  redimdant  inputs  or  redundant  middle  nodes,  the  weights  w  are  no  longer 
multivariate  normally  distributed;  therefore,  the  x«*  statistic  is  no  longer  distributed  as  a 

under  the  null  hypothesis.  If  ill-conditioning  is  present  or  suspected,  one  should  proceed  with 
caution.  Ideally,  all  redundancies  should  be  eliminated  in  order  to  perform  the  irrelevant  input 
hypothesis  test  with  the  x^*  test  statistic.  The  second  potential  problem  can  not  always  be  avoided 
or  eliminated.  The  computation  of  x^*  may  become  intr^table  even  when  the  network  is  trained 
and  reduii'.tandes  in  the  features  and  middle  nodes  have  been  eliminated.  The  intractability  is 
primarily  due  to  the  inversion  of  the  s-dimensional  matrix  A  which  is  required  for  estimating  the 
covariance  matrix  C*. 

The  practical  limitations  on  the  problem  size  can  be  mitigated  by  making  simplifying  assump¬ 
tions  about  the  covariance  structure  of  the  weights.  These  assumptions  will  in  turn  simplify  the 
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process  of  inTerting  the  Hessian  matrix  A.  However,  improving  the  practical  implementation  when 
there  are  sise  limitations  only  makes  sense  if  the  procedure  is  useful  for  consistently  identifying 
irrelevant  features.  The  performance  of  the  weight  screening  procedure  is  evaluated  in  the  next 
section. 

4. 3.5  Analysis  In  this  section,  an  empirical  evaluation  01  the  weight  screening  procedure  is 
presented  using  the  XOR,  API,  and  FLIR  problons.  For  all  three  problems,  results  are  presented 
for  the  same  30  runs  which  were  used  with  the  saliency  screening  in  Section  2.  All  of  the  same 
training  sets,  training  parameters,  and  hnal  weight  estimates  are  used. 

Recall  the  XOR  problem  which  was  augmented  with  five  additional  noise  features  (all  treated 
as  candidate  features).  The  weight  screening  hypothesis  test  results  for  the  XOR  problem  are 
summarized  in  Table  20.  The  number  of  times  the  null  hypothesis  was  rejected  in  30  neural 
network  runs  is  reported  in  the  first  column  of  Table  20.  For  example,  the  null  hypothesis  was 
rejected  for  feature  z  in  25  out  of  30  runs.  This  means  that  in  five  of  the  runs,  the  vector  of  weights 
connected  to  feature  x  were  not  statistically  different  than  0.  Depending  on  which  run  is  used,  the 
‘true’  features  x  eind  y,  may  or  may  not  be  statistically  different  than  noise.  La  fact,  for  one  of  the 
network  runs  the  test  statistic  &om  one  of  the  noise  features  wu  larger  than  the  test  statistic  from 
one  of  the  ‘true’  features.  The  average  x^*  statistics  over  30  runs  are  reported  in  the  second 
column  of  Table  20.  While  the  results  shown  in  column  one  indicate  what  might  occur  on  any 
given  run,  the  results  shown  in  column  two  indicate  what  can  be  gleaned  if  the  information  from 
all  30  runs  is  combined.  On  this  problem,  the  average  test  statistics  indicate  the  features  x  and  y 
are  relevant,  and  the  noise  features  are  irrelevant. 

The  second  problem  used  is  the  API  problem  which  is  augmented  with  two  additional  noise 
features  (both  treated  as  candidate  features).  The  weight  screening  hypothesis  test  results  are 
summarized  in  Table  21.  The  number  of  times  the  null  hypothesis  was  rejected  in  30  neural  network 
runs  is  reported  in  the  first  column  of  Table  21.  On  any  individual  run,  the  weight  screening  results 
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Table  20.  Weight  Sareening  Results  on  XOR  Problem  for  30  Runs 


Feature 

Ho  Rejections 
out  of  30 

Average  x** 

X 

25 

139.57 

y 

22 

145.78 

noise  1 

3 

4.74 

noise  2 

0 

3.70 

noise  3 

1 

4.31 

noise  4 

0 

3.89 

noise  5 

3 

5.32 

Individual  Significance  Level  .005 

Critical  Value  x^  =  14.86,  where  9  =  4 

may  or  may  not  be  consistent  with  the  saliency  screening  results.  That  is,  the  noise  features  and 
Ms  may  or  may  not  be  irrelevant  and  the  other  ‘salient’  features  may  or  may  not  be  identified 
as  irrelevant.  The  average  x’*  test  statistics  over  30  runs  are  reported  in  the  second  column  of 
Table  21.  On  this  problem,  the  average  test  statistics  indicate  that  only  the  noise  features  are 
irrelevant.  The  feature  Ms  probably  warrants  further  investigation,  since  its  test  statistic  is  fairly 
close  to  the  critical  value. 

The  third  problem  used  is  the  FLIR  problem  which  was  augmented  with  two  additional  noise 
features  (both  treated  as  candidate  features).  The  weight  screening  hypothesis  test  results  are 
summarized  in  Table  22.  The  number  of  times  the  nuU  hypothesis  was  rejected  over  30  neural 
network  runs  is  reported  in  the  first  column  of  Table  22.  These  results  indicate  that  none  of  the 
features  are  consistently  relevant.  In  fact,  on  several  of  the  30  simulations  none  of  the  features  have 
weights  which  are  statistically  relevant  (i.e.  weights  that  are  statistically  different  than  0).  The 
average  x^*  test  statistics  over  30  runs  are  reported  in  the  second  column  of  Table  22.  For  this 
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Table  21.  Weight  Screening  Results  on  API  Problem  for  30  Runs 


Feature 

Ho  Rejections 
out  of  30 

Average  x^* 

PLY 

29 

126.3 

Vs 

25 

51.8 

Ms 

14 

20.5 

SECT 

29 

106.67 

noise  1 

6 

11.21 

noise  2 

_ 11 _ 

15.61 

Individual  Significance  Level  .005 

Critical  Value  xj  =  16.75,  where  g  =  5 

problem,  the  average  test  statistics  indicate  that  only  three  features  are  relevant.  The  iifth  feature 
which  has  the  largest  average  test  statistic  also  has  the  largest  saliency  screening  test  statistic  in 
Table  19.  However,  feature  two  which  has  the  second  largest  sjdiency  test  statistic  in  Table  19  was 
identified  relevant  only  seven  times  out  of  the  thirty  runs  using  the  weight  screening  procedure. 

On  all  of  these  problems,  the  weight  screening  methodology  did  not  provide  consistent  reliable 
results  for  the  30  runs  evaluated.  For  the  first  two  problems,  noise  was  not  consistently  identified 
as  irrelevant.  Using  the  weight  screening  procedure  resulted  in  identifying  some  or  even  all  of  the 
true  variables  as  irrelevant  on  some  individual  runs. 

Although  the  weight  screening  procedure  is  meant  to  be  used  with  just  a  single  neural  network 
run,  the  average  test  statistics  over  30  runs  has  more  potential  as  a  screening  tool.  For  the  XOR 
and  API  problems,  the  average  test  statistics  indicate  that  all  of  the  features  are  relevant  except  for 
noise.  However,  when  using  the  average  test  statistics  for  the  FLIR  problem,  none  of  the  features 
are  relevant  except  feature  five. 
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T&ble  22.  Weight  Screening  Results  on  FLIR  Problem  for  30  Runs 


Feature 

Ho  Rejections 
out  of  30 

Average 

1 

4 

10.6 

2 

7 

10.3 

3 

1 

2.0 

4 

12 

14.8 

5 

23 

39.8 

6 

5 

7.2 

7 

11 

15.3 

8 

5 

8.4 

9 

0 

2.5 

noise  1 

1 

3.2 

noise  2 

1 

3.1 

IndiTidual  Significance  Level  .005 

Critical  Value  x*  =  14.86,  where  g  =  4 
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4-3.6  Summary  The  weight  screening  procedure,  while  a  single  run  procedure,  is  not  robust. 
It  does  not  provide  consistent  reliable  results  for  a  single  run  This  indicates  the  procedure  is 
extremely  sensitive  to  less  than  ideal  conditions  where  the  weights  may  not  be  multivariate  normally 
distributed  in  the  limit  or  may  not  be  from  a  perfectly  trained  network.  Farther,  the  weight 
screening  procedure  is  computationally  impractical  for  reasonable  sixed  networks  due  to  the  required 
Hessian  matrix  inversion. 

In  the  identification  of  noise,  the  average  test  statistics  over  several  runs  are  more  useful  than 
the  test  statistics  from  a  single  run.  However,  in  the  case  of  the  FLIR  problem,  only  one  feature  is 
identified  as  relevant.  As  with  the  saliency  screezdng  procedure,  a  visual  comparison  of  the  average 
test  statistics  might  be  useful  to  categorize  features  as  ‘relevant’  and  ‘irrelevant’  for  some  problems. 

4-4  Summary 

Two  statistically  based  feature  screening  procedures  etre  discussed  in  this  chapter.  The  two 
statistically  based  methods  for  the  identifying  nonsalient /irrelevant  features  are  the  saliency  screen¬ 
ing  procedure  and  the  weight  screening  procedure.  In  the  saliency  screening  procedure,  a  feature’s 
mean  saliency  is  statistically  compared  to  the  mean  saliency  of  a  known  noise  feature.  In  the  weight 
screening  procedure,  the  weights  emanating  from  a  feature  are  tested  to  see  if  they  are  statistically 
different  than  0.  With  either  of  these  statistical  screening  procedures,  features  can  be  ranked  and 
categorized  as  ‘essential’  and  ‘nonessential’  based  on  their  test  statistics. 

It  is  desirable  for  a  feature  screening  procedure  to  be  able  to  consistently  identify  noise  or 
irrelevant  features  as  different  from  other  features.  An  evaluation  of  both  procedures  demonstrates 
that  the  saliency  screening  procedure  is  more  robust  than  the  weight  screening  procedure,  since 
it  consistently  identifies  noise-like  features  as  being  nonsalient /irrelevant  whereas  the  the  weight 
screening  procedure  does  not.  Additionally,  disadvantages  are  associated  with  the  weight  screening 
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procedure  which  are  not  associated  with  the  saliency  screening  procedure.  A  summary  of  the 
disadvantages  follows. 

•  The  procedure  requires  the  weight  parameters  to  have  a  miiltivariate  normal  limiting  distri¬ 
bution,  which  is  often  violated  in  the  practical  implementation  of  neural  networks. 

•  The  procedure  relies  on  a  weakly  consistent  estimate  of  the  limiting  covariance  matrix  of  the 
weights. 

•  The  procedure  for  estimating  the  limiting  covariance  matrix  of  the  weights  requires  the  second 
derivative  information  from  the  Hessian  matrix. 

•  The  procedure  has  practical  limitations  on  the  problem  size,  since  the  Hessian  matrix  must 
be  inverted. 

•  The  irrelevant  noise  features  are  not  consistently  identified  with  this  procedure. 

The  statistical  saliency  screening  procedure  presented  in  this  chapter  is  useful  for  identifying 
and  eliminating  noisy  features  from  a  set  of  candidate  features  up  front  prior  to  using  a  feature 
selection  procedure  for  investigating  reduced  feature  subsets.  In  the  next  chapter,  two  novel  neural 
network  selection  algorithms  are  presented  for  investigating  reduced  feature  subsets  and  appropriate 
neural  network  architecture. 
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y.  Selection  Algorithma  for  Neural  Networks 


5.1  Introduction 

In  this  chapter,  a  statistical  model  selection  perspective  is  adopted  for  neural  networks.  Algo¬ 
rithms  are  developed  for  automating  the  process  of  model  selection  for  feedforward  neural  networks. 
Both  algorithms  are  developed  based  on  nonlinear  regression  model  selection  criteria  within  a  back¬ 
wards  sequential  procedure.  The  first  is  an  initial  architecture  selection  algorithm  for  determining 
an  appropriate  number  of  middle  nodes.  The  second  is  a  feature  selection  algorithm  for  reducing 
feature  sets  which  may  be  either  larger  than  necessary,  or  larger  than  desired.  An  overview  of  the 
chapter  follows. 

Univariate  response  nonlinear  regression,  model  selection  criteria,  and  practical  considera¬ 
tions  for  neural  network  model  selection  are  reviewed  in  Section  2.  In  Section  3,  a  backwards 
sequential  selection  algorithm,  and  neural  network  algorithms  for  architecture  and  feature  selection 
are  presented.  The  results  in  this  chapter  are  summarized  in  Section  4. 

5.2  Univariate  Nonlinear  Regression  and  Model  Selection 

5.2. 1  Introduction.  “Backpropagation  and  nonlinear  regression  can  be  viewed  as  alternative 
statistical  approaches  to  solving  the  least  squares  problem”  [80:85].  The  major  difference  between 
instantaneous  backpropagation  least  squares  nonlinear  regression  is  that  in  least  squares  nonlinear 
regression,  the  weight  parameters  are  determined  with  techniques  which  use  “batched”  updates 
rather  than  “instantaneous”  updates  which  use  one  data  point  at  a  time.  In  this  section,  a  single 
output  neural  network  trained  with  backpropagation  is  formalized  by  adopting  the  framework  of 
a  univariate  nonlinear  least  squares  regression  model  [80:85-89].  Using  this  framework,  nonlinear 
regression  statistical  model  selection  criteria  are  defined.  Practical  considerations  for  neural  network 
model  selection  are  disctused. 
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5.2.S  The  Neural  Network  Nonlinear  Regreaaion  Model.  Consider  a  single  output  neural 
network  as  a  univariate  response  nonlinear  regression  model. 

d  =  .(X.w*)  +  e  (66) 

where  d  is  the  Pxl  vector  of  true  ‘desired’  network  outputs;  s(X,  w)  can  be  interpreted  as  E(d|X) 
the  Pxl  vector  of  neural  network  responses  conditioned  on  the  PxM  matrix  of  feature  input 
variables  X;  w*  is  the  sxl  vector  of  unknown  optimal  weight  parameters;  and  e  is  the  Pxl  vector 
of  neural  network  errors.  The  least  squares  estimator  of  w*  is  the  sxl  vector  w  that  minimizes 
the  neural  network  sums  of  squared  errors  (SSE)  with  respect  to  w. 

w  =  argmin{SSE(w)} , 

where 

SSE(w)  =  [d  -  *(X,  w)nd  -  z(X,  w)]  (67) 

An  estimate  of  the  variance  of  the  errors  e  corresponding  to  the  least  squares  estimator  w  is 

SSE(w) 

P-s 

When  there  are  no  isolated  flats  in  the  parameter  space  caused  by  redundant  inputs  or 
middle  nodes,  the  unknown  optimal  weight  vector  w*  is  said  to  be  locally  unique  [80:106].  The 
neural  network  model  is  said  to  be  locally  identifiable,  that  is  w*  is  a  unique  local  solution,  and 
the  estimated  weight  vector  w  has  a  multivariate  normal  distribution  [80:106-7]. 

A  discussion  of  conditional  probability  laws  as  they  pertain  to  the  interpretation  of  neural 
networks  can  be  found  in  White  [80:95].  In  the  discussion.  White  recognizes  the  special  case  of  the 
binary  dependent  variable  d  associated  with  any  two-way  classification  problem  [80:95].  Minimizing 
the  squared  errors  with  backpropagation  produces  a  miTiiTTniin  mean-squared  error  estimator  w 
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and  a  minimurn  mean-squared  error  approximation  to  E(d|X)  [80:98].  By  the  definition  of  e  = 
d  -  b(X,  w*)  and  the  properties  of  conditional  expectation,  the  expectation  of  e  conditioned  on  X 
is  zero  [82:430].  With  the  added  assumption  of  normally  distributed  errors,  least  squares  nonlinear 
regression  is  equivalent  to  the  method  of  maximum  likelihood  nonlinear  regression  [80:259]. 

5.2.3  Three  Selection  Criteria.  There  are  three  model  selection  criteria  for  correctly  spec¬ 
ified  nonlinear  regression  models  which  are  associated  with  normally  distributed  errors.  They  are 
the  Wald,  Lagrange  multiplier,  and  likelihood  ratio  tests.  Taken  in  the  context  of  neural  network 
model  selection,  all  of  these  tests  can  be  defined  with  the  same  general  hypothesis  test: 

Null  Hypothesis  Hq  :  Sw*  =  0 
Alternative  Hypothesis  :  Sw*  ^  0  , 


where  w*  is  an  s  x  1  vector  of  neural  network  optimal  weight  parameters  and  S  is  a  gxs  selection 
matrix  picking  out  the  q  elements  of  w*  which  are  hypothesized  to  be  zero  under  Hq. 

All  three  tests  have  an  associated  model  selection  criterion,  sometimes  referred  to  as  a  test 
statistic.  These  are  denoted  as:  W  for  the  Wald  test  statistic,  Ri  or  £3  for  the  Lagrange  multiplier 
test  statistics,  and  L  for  the  likelihood  ratio  test  statistic.  In  this  section,  these  test  statistics  are 
defined  in  neural  network  terminology. 

The  Wald  test  statistic  is  adapted  from  GaUemt  [20:48],  and  White  [80:109]. 


w'S'(S(F'P)-iS')-^Sw/g 

SSE/P-s 


(68) 


where  F  =  F(w)  and  where  F  is  defined 


6 


(69) 
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and  SSE  is  the  sum  of  squared-errors  defined  in  Equation  67  which  is  evaluated  for  the  Ml  model. 
Under  the  null  hypothesis,  W  in  Equation  68  is  approximately  F  distributed  with  g  numerator  and 
P  -  s  denominator  degrees  of  freedom  [20:48]. 

For  the  Lagrange  multiplier  test  statistics,  the  SSE  is  minimized  for  the  full  model  subject  to 
the  null  hypothesis.  This  is  a  constrained  minimization  where  the  resulting  estimator  is  denoted 
w.  The  unconstrained  minimization  of  SSE  results  in  the  estimator  w.  Using  the  constriuned 
estimator  w  as  a  starting  value,  one  ‘unconstrained’  Gauss-Newton  step  is  taken  towards  w.  In 
neural  network  terms,  a  Gauss-Newton  step  is  equivalent  to  one  epoch  of  instantaneous  updates  or 
one  batch  update.  The  Gauss  Newton  step,  denoted  D  is  defined  in  Gallant  [20:85]  as 

D=  (F'F)''p'[d-z(X,w)], 


where  F  =  F(w),  similar  to  Equation  69.  If  Hq  is  true,  then  D  will  be  small.  Conversely,  if  Ho  is 
false,  then  D  will  be  large. 

There  are  two  versions  of  the  Lagrange  multiplier  test  statistics,  Ri  and  R2.  Each  is  based 
on  a  measure  of  D.  The  definitions  of  Ri  and  Rq  are  adapted  from  Gallant  [20:86]. 

d-(f-f)d/, 

'  SSE(w)/(P  -  3) 


R2 


nfy  (f'f)  D 
SSE(w)  ’ 


where  SSE(w)  represents  the  constrained  minimum  stun  of  squared  errors  associated  with  fitting  the 
full  model  subject  to  Ho,  and  SSE(w)  represents  the  minimum  sum  of  squared  errors  after  taking 
one  unconstrained  Gauss  Newton  step.  When  using  the  first  version  of  the  Lagrange  multiplier 
test  statistic,  the  null  hypothesis  is  rejected  when  Ri  exceeds  the  a  x  100  critical  point  of  the 
F-distribution,  denoted  Fi^a;q,p-$-  When  using  the  second  version  of  the  Lagrange  multiplier  test 
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statistic,  the  null  hypothesis  is  rejected  when  exceeds  where 


“  {P  -  -)/(?)  +  Fa 

The  likelihood  ratio  test  involves  fitting  two  nonlinear  regression  models.  The  full  model  is 
fit  under  the  alternative  hypothesis,  and  the  reduced  model  is  fit  under  the  null  hypothesis.  The 
full  model  has  P  —  »  degrees  of  freedom,  and  the  reduced  model  has  P-{$  —  q)  degrees  of  freedom, 
where  q  is  the  number  of  weight  parameters  constrained  to  be  sero  in  the  reduced  model.  The 
associated  likelihood  ratio  test  adapted  from  Gallant  [20:56]  is: 

r  _  (SSEjt  —  SSEf )  / q 
~  (SSEf) /{P-s)  ’ 

where  SSEjp  and  SSE^  are  the  full  and  reduced  models’  sums  of  squared-errors.  Under  Ho,  L  is 
E-distributed  with  q  numerator  and  P  —  s  denominator  degrees  of  freedom.  Therefore,  whenever 
L  exceeds  the  a  x  100  critical  point  of  the  E-distribution,  the  null  hypothesis  is  rejected. 

5.2.4  Specification  Robust  Selection  Criteria.  The  neural  network  model  is  generally 

an  incorrectly  specified  probability  model  and  normally  distributed  errors  can  not  be  assumed. 
White  “explores  the  ramifications  of  maximising  a  general  likelihood  fimction  that  is  not  derived 
from  a  correctly  specified  probability  model  -  precisely  the  situation  under  which  network  learning 
generally  occurs”  [80:259-288]. 

For  situations  where  the  errors  are  non-normal  or  the  modd  is  not  correctly  specified.  White 
describes  specification  robust  versions  of  the  Wald  and  Lagrange  multiplier  tests  which  are  asymp¬ 
totically  distributed  as  [80:263-267].  The  robust  version  of  the  Wald  test  statistic  for  neural 
network  modd  sdection  is  identical  to  the  wdght  screening  test  statistic  presented  in  Chapter  IV 
[80:109,266].  The  robust  versions  of  both  of  these  test  statistics  are  given  in  White  [80:266-267]. 
Under  misspecification,  the  likelihood  ratio  statistic  is,  generally,  not  equivalent  to  the  Wald  or 
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Lagrange  multiplier  teat  statistic,  nor  is  it  asymptotically  distributed  as  (in  the  version  of 
the  test  statistic  considered  by  Gallant)  [80:267],  [20:591]. 

The  primary  difference  between  the  specification  robust  and  non-specification  robust  versions 
of  the  Wald  and  Lagrange  multiplier  test  statistics  is  that  the  weight  parameter  covariance  matrix 
is  no  longer  approximated  by  Instead,  the  matrix  is  approximated  by  An 

analytical  expression  for  C*  is  not  available,  but  an  estimator  C  exists  which  is  weakly  consistent. 
The  estimator  is  defined  as 

C  =  A-*BA-*  (71) 

The  matrices  A  and  B  are  defined 

A  =  P~‘5^V*f(x4,w) 

B  =  p-»f;Vf(x.,w)Vf(x.,w)' 

t=i 

where  P  is  the  sample  sise;  S  is  the  neural  network  error  used  for  training;  the  operator  V  is  the 
s  X  1  gradient  of  6  with  respect  to  w,  and  the  operator  is  the  s  x  s  Hessian  operator  of  6  with 
respect  to  w,  where  X(  is  the  i  -th  input  exemplar.  Replacing  C*  with  C  does  not  affect  the  limiting 
X^  distribution;  however.  White  warns  that  a  much  larger  sample  sise  may  be  required  to  obtain  a 
good  approximation  of  C*  [79:442-443]. 

Some  disadvantages  associated  with  the  specification  robust  Wald  statistic  are  documented 
for  its  application  to  neural  network  feature  screoiing  in  Chapter  IV.  They  are  reviewed  here  for 
completeness: 

•  The  procedure  requires  the  weight  parameters  to  have  a  multivariate  normal  limiting  distri¬ 
bution.  Due  to  redundant  inputs  and  middle  nodes,  this  is  often  violated  in  the  practical 
implementation  of  neural  networks. 
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•  The  procedure  relies  on  a  weakly  consistent  estimate  of  the  limiting  covariance  matrix  of  the 
weights. 

•  The  procedure  for  estimating  the  limiting  covariance  matrix  of  the  weights  requires  second 
derivative  information  from  the  Hessian  matrix. 

s  The  procedure  has  practical  limitations  on  the  problem  sise  since  the  Hessian  matrix  must 
be  inverted. 

•  Irrelevant  noise  features  are  not  consistently  identified  with  this  procedure  (see  results  shown 
in  Chapter  IV,  Section  3.5). 

For  the  specification  robust  version  of  the  Li^ange  multiplier  test,  these  disadvantages  remain. 
However,  no  decisive  statement  can  be  made  regarding  the  last  disadvantage,  since  use  of  the 
Lagrange  multiplier  test  for  nettral  network  model  selection  has  not  been  documented. 

5.2.5  Practicai  Consideratioru  for  Neural  Network  Selection.  For  practical  implementation 
by  a  neural  network  practitioner,  the  specification  robust  versions  of  model  selection  criteria  are 
associated  with  computationally  burdensome  test  statistics  and  restrictive  assumptions  which  are 
often  compromised  in  a  model  selection  scenario. 

The  required  estimation  of  the  covariance  matrix  of  the  weights  involves  the  formation  and 
subsequent  inversion  of  the  (possibly  ill-conditioned)  Hessian  matrix.  This  is  computationally 
burdensome,  and  puts  a  practical  limitation  on  the  sise  of  the  network  to  be  analysed 

When  there  are  no  redundancies  in  either  the  network  structure  or  feature  inputs  the  required 
normality  assumption  is  satisfied  [80:105].  White  assumes  control  can  be  exercised  over  the  limiting 
multivariate  distribution  of  the  weight  parameters  and  that  the  neural  network  practitioner  has  ex¬ 
pended  the  effort  necessary  to  remove  feature  and  middle  node  redundancies  [80:106-108].  However, 
redundancies  are  not  always  easy  to  detect  or  remedy.  This  fact  has  been  well  documented  for  the 
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phenomena  of  multicoUinearity  in  linear  regression  [46].  Flirt the  specification  robust  tests  are 
not  appropriate  for  identifying  unnecessary  redundant  parameters  [80:109,145]. 

Consider  a  neural  network  feature  selection  scenario  where  irrelevant  noise  features  have 
been  identified  and  removed  through  the  feature  screening  procedures  defined  in  Chapter  IV.  In 
this  feature  selection  scenario,  any  candidate  features  which  should  be  eliminated  are  probably 
redundant.  This  is  precisely  the  scenario  where  the  specification  robust  test  statistics  have  problems, 
since  the  asymptotic  normality  assumptions  do  not  hold  in  the  presence  of  redundancies. 

Since  the  advantage  of  specification  robust  test  statistics  is  generally  compromised  in  a  feature 
selection  scenario,  the  ‘^on-specification  robust”  test  statistics  are  considered  further.  Of  the  three 
“non-specification  robust”  test  statistics  discussed  in  Section  2,  the  likelihood  ratio  test  is  the  most 
appealing  for  practical  use.  It  is  appealing  for  the  following  reasons: 

•  The  likelihood  ratio  test  is  not  affected  by  parameter  effects  curvatures  which  are  associated 
with  the  linearized  approximation  of  s(X,  w)  used  for  the  Wald  statistic  [66:200-220]. 

•  The  likelihood  ratio  test  is  more  powerful  than  either  version  of  the  Lagrange  ratio  test 
statistics  [20:89]. 

•  The  likelihood  ratio  test  statistic  is  the  only  test  statistic  which  does  not  require  estimating 
and  subsequently  inverting  the  covariance  matrix  of  the  weight  parameters. 

•  The  likelihood  test  statistic  is  based  on  comparing  minimum  sums  of  squared-errors  which  is 
the  basis  for  parameter  estimation  in  standard  backpropagation  training. 

•  The  likelihood  ratio  test  statistic  is  prominently  used  in  linear  regression  model  selection 
algorithms. 
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5.S  NeunU  Network  Selection  Algorithms 


5.3.1  Introduction.  Two  neural  network  model  selection  algorithms  are  proposed  in  this 
section:  the  initial  architecture  selection  algorithm  and  the  feature  selection  algorithm.  Both  al¬ 
gorithms  are  developed  by  adopting  a  statistical  model  building  perspective  using  a  backwards 
sequential  procedure.  For  completeness,  the  backwards  sequential  selection  algorithm  will  be  re¬ 
viewed  before  it  is  used  with  the  model  selection  algorithms.  The  backwards  sequential  selection 
algorithm  is  proposed  in  this  research  because  potentially  better  features  may  be  selected  when 
correlated  features  are  present  [43]. 

Due  to  the  practical  considerations  discussed  in  Section  2,  the  likelihood  ratio  test  statistic 
is  used  in  this  research.  Since  neural  network  models  do  not  generally  have  normally  distributed 
errors,  the  likelihood  ratio  test  statistic  L  in  Equation  70  will  not  be  exactly  F-distributed.  The 
assumption  is  that  L  will  be  approximately  F-distributed,  and  that  the  relative  magnitudes  of  the 
likelihood  test  statistics  are  more  palatable  if  one  applies  the  F  distribution  at  some  appropriately 
conservative  significance  level.  The  assumptions  about  the  approximate  distribution  of  L  give  the 
proposed  selection  algorithms  a  heuristic  flavor. 

Neural  networks  trsdned  with  backpropagation  can  get  stuck  at  local  minima  or  saddle  points, 
or  they  can  diverge  [80:143].  White  recommends  several  neural  networks  be  trained,  and  the 
network  which  has  converged  to  the  smallest  error  be  used  [80:143].  To  be  more  specific,  several 
nets  should  be  trained  using  a  different  random  initialization  of  the  weight  parameters  and  the  order 
of  presentation.  When  multiple  neural  networks  are  used,  the  minimum  error  network  will  usually 
yield  consistent  parameter  estimates  for  some  local  minima,  but  there  is  still  no  guarantee  of  being 
close  to  a  global  minima  [80:143].  Using  multiple  neural  networks  helps  to  minimize  the  probability 
of  accepting  a  network  which  has  not  converged  to  a  good  local  minima.  For  this  reason,  both 
of  the  neural  network  selection  algorithms  proposed  in  this  section  use  multiple  neural  networks. 
From  among  the  neural  networks,  a  best  network  is  selected  based  on  the  minimum  total  SSE  of 
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the  training  and  test  sets.  The  number  of  neural  networks  one  can  afford  i  any  given  situation 
is  iisually  limited  by  the  time  and  computing  resources  available.  A  reasonable  number  of  neural 
networks,  five  or  ten,  is  usually  sufficient  for  finding  a  good  local  mifiimum. 

5.S.2  Backwards  Sequenticd  Selection  The  likelihood  ratio  test  statistic  can  be  used  within 
a  sequential  search  algorithm  for  model  selection.  The  backwards  sequential  selection  algorithm 
is  proposed  in  this  research  because  potentially  better  features  may  be  selected  when  correlated 
features  are  present  [43].  For  completeness,  a  backwards  sequential  feature  selection  algorithm  is 
outlined. 


Backward  Sequential  Selection 

1.  Set  k  =  M,  where  M  is  the  total  number  of  candidate  features. 

2.  Choose  a  significance  level  a  for  feature  elimination. 

3.  Estimate  the  full  model  with  k  candidate  features.  Associated  with  this  full  model  is  SSEj? 
for  univariate  models  and  Tp  for  multivariate  response  models. 

4.  Set  p  =  k  —  1,  where  p  is  the  n\unber  of  features  in  the  reduced  model. 

5.  Estimate  all  models  with  p  features  which  do  not  include  any  previously  eliminated  featmes. 
Associated  with  each  of  the  k  reduced  models  is  SSEjj  for  univariate  models  and  Tr  for  mul¬ 
tivariate  response  models. 

6.  Compute  the  likelihood  ratio  test  statistic  using  Equation  70  for  each  of  the  k  models  of  p 
features. 

7.  Select  as  a  candidate  feature  for  elimination,  the  feature  which  when  removed  produces  the 
model  with  the  lowest  likelihood  ratio  test  statistic  L. 

S.  If  L  <  Fa  (for  appropriate  nvimerator  and  denominator  degrees  of  freedom),  eliminate  the 
candidate  feature,  set  Jb  =  i  -  1  and  go  to  Step  3. 

Otherwise,  go  to  Step  9  and  do  not  eliminate  the  candidate  feature. 

9.  Stop. 
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5.3.3  Architecture  Selection  Algorithm.  In  practice,  the  neural  network  practitioner  does 
architecture  selection  through  a  trial  and  error  process  luing  an  informal  set  of  “pilot  runs.”  The 
pilot  runs  consist  of  a  series  of  neural  networks  trained  with  a  varying  number  of  middle  nodes. 
Quite  often,  the  practitioner  selects  the  architecture  associated  with  the  smallest  network  structure 
which  can  be  used  for  accurate  prediction.  However,  no  specific  criteria  are  universally  accepted 
for  automatically  identifying  a  good  initial  network  structure. 

In  this  section,  a  formal  architecture  selection  algorithm  is  proposed  for  a  univariate  response 
neiual  network.  The  algorithm  uses  the  concepts  of  statistical  model  selection  described  in  Section 
2  to  evaluate  an  organized  series  of  pilot  runs.  Reduced  neural  network  architectures  are  system- 
aticedly  investigated.  At  each  step  of  the  algorithm,  a  likelihood  ratio  test  statistic  is  evaluated. 

Figure  16  depicts  the  neural  network  architecture  algorithm  for  a  univariate  response  neural 
network.  A  description  of  the  algorithm  follows. 

Neural  Network  Architecture  Selection  Algorithm 

1.  Initialize  and  define  parameters. 

•  Number  of  middle  nodes:  H. 

•  Significance  level  for  statistical  testing  of  reduced  structure  models:  a. 

•  Number  of  neural  network  runs:  I. 

•  Number  of  network  features:  M. 

•  Number  of  training  set  exemplars:  Ptr- 

•  Number  of  training-test  set  exemplars:  Pu 

•  Total  number  of  exemplars:  P  =  Ptr  +  Pu 

•  Number  of  network  weight  parameters:  s. 
a  =  {M  +  l)H  +  {H  +  1) 
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Figure  16.  Neural  Network  Architecture  Selection  Algorithm 
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2.  TVain  I  realuations  of  the  full  neural  network  model  with  H  middle  nodes. 

•  Compute  the  SSE  for  P  exemplars  using  Equation  67. 

Define  SSEf^  =  SSE  for  the  tth  netual  network  realisation. 

Define  SSEj?  =  min{SSEj7, ,  •  •  • ,  SSEf,} 

3.  Train  I  realisations  of  the  reduced  neural  network  model  with  .ff  -  1  middle  nodes. 

•  Compute  the  SSE  for  P  exemplars  using  Equation  67. 

•  Define  SSEr^  =  SSE  for  the  tth  neural  network  realisation. 

•  Define  SSEr  =  min{SSEi{, ,  •  •  • ,  SSEr,}. 

4.  Compute  degrees  of  freedom. 

•  Full  model:  df/?  =  P  -  a. 

•  Reduced  model:  df^  =  P  -  [s  -  (M  +  !)]• 

5.  Compute  the  Likelihood  Ratio  Test  Statistic:  L. 

-  r  _  [3SE«-3SB,l/(df|,-df,) 

•  ^  —  SSB^/dfr 

6.  Test  the  null  hypothesis  that  the  reduced  model  is  equivalent  to  the  full  model  using  L  defined 
in  Equation  70. 

•  If  !■  <  Pa,dfjt-df,,df,>  the  reduced  model  cem  not  be  rejected. 

-  Accept  the  reduced  structure  model. 

-  H  =  H-1 

-  s  =  s  -  [M  +  2] 

-  Go  to  Step  2. 

•  U  L  >  Pa,df«-df,,df, ,  then  reject  the  reduced  model. 
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-  Go  to  Step  7. 


7.  Stop,  an  appropriate  network  structure  has  been  determined  with  H  middle  nodes. 

In  Step  1,  parameters  are  initialized  and  defined.  A  reasonable  ntunber  of  neural  networks 
is  usually  five  or  ten.  When  trying  to  find  good  or  improved  local  minima,  there  is  a  point  of 
diminishing  returns  for  the  computational  cost  of  running  additional  networks. 

In  Steps  2  and  3,  each  training  resdization  involves  a  different  training  and  training-test  set 
which  is  randomly  selected  from  among  the  total  pool  of  exemplars.  Randomly  selected  training 
and  training-test  sets  means  that  each  neural  network  run  is  trained  using  a  unique  partitioning  of 
the  data.  Normal  proportions  of  splitting  the  training  and  training-test  sets  is  a  |,  |  split  [76:30]. 
With  smaller  data  sets,  it  is  common  to  use  a  larger  portion  of  the  data  in  the  training  set  [76:30]. 

To  fairly  compare  results,  the  SSE  is  computed  in  Steps  2  and  3  using  all  P  exemplars,  so 
the  full  and  reduced  models  are  compared  using  the  same  data  set.  If  there  is  sufficient  data,  a 
validation  set  SSE  can  be  used  to  provide  the  most  unbiased  comparison  between  models. 

A  finite  number  of  full  and  reduced  model  networks  are  trained.  From  this,  the  networks 
corresponding  to  the  best  full  and  reduced  model  local  minima  are  used  to  form  the  likelihood 
ratio  in  Step  5.  In  this  scenario,  where  only  a  finite  number  of  networks  are  trained,  it  is  possible 
that  the  reduced  model  may  produce  a  lower  total  SSE  than  the  full  model.  If  the  likelihood  test 
statistic  is  negative  in  Step  6,  the  reduced  model  should  be  accepted,  since  the  best  reduced  model 
from  /  nms  is  better  than  the  best  full  model  from  I  runs. 

The  architecture  selection  algorithm  is  generally  initialized  with  more  thzm  enough  middle 
nodes  (i.e.  some  middle  nodes  are  redimdant).  Redtmdant  middle  nodes  in  a  neural  network’s 
structure  cause  flat.',  in  the  parameter  space,  which  may  cause  premature  convergence  to  undesirable 
local  minima.  Multiple  runs  are  used  to  avoid  using  neural  networks  which  have  not  converged  to 
good  local  minimum. 
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Some  prior  knowledge  about  the  complexity  of  the  problem  at  hand  is  useful  for  setting  a 
reasonable  number  of  initial  middle  nodes  for  the  architecture  selection  algorithm.  The  number 
of  middle  nodes  needs  to  be  large  enough  to  ensure  there  is  sufficient  complexity  for  accurate 
prediction  on  a  test  data  set.  However,  an  excessive  number  of  middle  nodes  will  unnecessarily 
increase  the  computational  cost,  and  the  possibility  of  accepting  a  model  with  more  middle  nodes 
than  necessary. 

One  way  to  determine  the  maximum  number  of  middle  nodes  for  a  given  situation  is  to 
examine  the  degrees  of  freedom  for  the  full  model  6£p  in  Step  4.  Since  dfp  must  be  greater  than 
zero,  an  upper  bound  for  H  can  easily  be  derived  as 


H  « 


P-1 
M +  1 


(72) 


where  P  is  the  number  of  training  exemplars,  and  M  is  the  number  of  feature  inputs  not  including 
the  bias  term.  This  upper-botmd  is  much  greater  than  the  more  analytically-based  upperbound 
developed  by  Cover  [10]. 

Cover’s  rtile  is  based  on  the  separating  capacities  of  families  of  nonlinear  decision  surfaces  [10]. 
Cover  shows  that  a  family  of  surfaces  having  s  degrees  of  freedom  has  a  natural  separating  capacity 
for  2s  tr2uning  exemplars  [10].  Therefore,  unless  the  number  of  training  exemplars  is  greater  than 
the  separating  capacity  of  2s,  there  is  a  large  probability  of  eunbiguous  generalization  [10].  Cover’s 
theorem  trzmslates  into  a  more  restrictive  number  of  middle  nodes: 


H  < 


.5P-1 
M  +  1 


(73) 


It  is  possible  that  when  the  niimber  of  candidate  features  is  large  and  the  niunber  of  exemplars 
is  small.  Cover’s  rule  may  indicate  fewer  middle  nodes  than  are  required  for  the  complexity  of 
the  problem  at  hand  (a  situation  where  feature  reduction  is  necessary).  In  this  case,  one  should 
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proceed  with  caution  since  the  neural  network’s  ability  to  genaalize  to  unknown  data  can  be  easily 
compromised  in  this  situation. 

There  are  situations  where  the  required  complexity  is  unclear,  and  the  practitioner  is  not 
constrained  by  a  small  data  set.  One  option  to  the  practitioner  is  to  use  an  excessively  large 
number  of  middle  nodes  within  the  limits  defined  by  Equation  73.  However,  this  increases  the 
possibility  of  accepting  a  model  with  more  middle  nodes  than  necessary.  Initial  experimentation  is 
one  method  to  use  for  determining  an  appropriate  initialization  for  the  middle  nodes. 

When  computing  resources  are  available,  the  architecture  selection  algorithm  is  best  utilized 
by  performing  more  than  one  initialization  of  H.  Performing  multiple  architecture  selection  runs 
provides  the  practitioner  with  additional  insight  and  may  reduce  the  risk  of  potentially  accepting 
a  larger  than  necessary  network  structure.  The  downside  is  that  this  approach  can  also  complicate 
the  goal  of  automatic  architecture  selection,  since  multiple  architecture  selection  runs  may  produce 
multiple  network  structures.  In  general,  when  choosing  &om  a  variety  of  network  structures  which 
all  have  acceptable  prediction  accuracies,  the  smallest  network  structure  is  preferred. 

5.3.4  Application  of  the  Architecture  Selection  Algorithm  The  architecture  selection  algo¬ 
rithm  was  applied  to  the  API  problem  used  in  Chapter  IV  to  determine  whether  a  reduction  in  the 
network  architecture  could  be  justified.  Belue  and  Bauer  foimd  eight  middle  nodes  were  sufficient 
for  good  neural  network  performance  [6].  Using  this  prior  knowledge,  the  number  of  middle  nodes 
was  initialized  at  eight.  The  networks  are  trained  for  a  minimum  of  500  epochs  and  improvement 
in  the  test  set  SSE  is  monitored  every  50  epochs  to  justify  continuation  of  training.  Once  the  test 
set  SSE  no  longer  improves,  training  is  discontinued.  The  parameter  initializations  for  Step  1  are: 

•  Number  of  middle  nodes:  H  =  8. 

•  Significance  level  for  statistical  testing  of  reduced  structure  models:  a  =  .05. 

•  Number  of  neural  network  runs:  /  =  10. 
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•  Number  of  network  features:  M  =  4. 

•  Number  of  training  set  exemplars:  Ptr  =  181. 

•  Number  of  training-test  set  exemplars:  Pt,  =  100. 

•  Total  numb  jr  of  exemplars:  P  =  281. 

•  Number  of  network  weight  parameters:  s  =  49. 


Ten  architecture  selection  experiments  are  performed  with  different  initial  random  number 
seeds.  Backpropagation  follows  a  unique  gradient  descent  path  for  every  run.  The  path  is  unique  for 
three  reasons:  (1)  the  random  partitioning  of  the  training  and  test  sets,  (2)  the  random  initialisation 
of  the  weight  vector,  and  (3)  the  random  presentation  of  the  training  vectors.  Therefore,  the  result 
of  each  experiment  will  be  different .  The  results  are  summarised  for  the  ten  experiments  in  Table  23. 


Table  23.  Summary  of  Ten  API  Problem  Architecture  Selection  Ejq)eriment8 


Experiment 

Number 

Final 
Number 
Middle  Nodes 

SSE 

Corresponding 
%  Classification 
Error 

1 

8 

4.398 

1.779  % 

2 

8 

5.630 

1.779  % 

3 

7 

6.373 

2.850  % 

4 

7 

6.245 

2.847  % 

5 

7 

6.445 

2.135  % 

6 

7 

5.018 

1.423  % 

7 

6 

4.656 

2.491  % 

8 

6 

4.365 

2.491  % 

9 

6 

5.003 

2.847  % 

10 

6 

4.758 

1.779  % 
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These  results  demonstrate  that  backpropagation  converges  to  a  variety  of  local  minimmni. 
Each  SSE  represents  the  smallest  of  the  local  minimnms  from  10  different  runs.  The  'local  min- 
imtun’  phenomena  means  the  algorithm  will  not  always  converge  to  the  most  parsimonious  archi¬ 
tecture.  However,  these  results  seem  to  indicate  the  algorithm  is  conservative.  Therefore,  if  the 
initial  number  of  middle  nodes  is  sufficient,  the  final  number  of  middle  nodes  will  also  be  sufficient, 
whether  or  not  a  reduction  has  taken  place. 

All  of  the  selected  architectures,  whether  or  not  they  are  reduced,  are  associated  with  good 
SSE'a  and  classification  errors.  On  average,  there  is  not  a  great  difference  between  the  tninimum 
55E’s  or  classification  error  rates  of  the  various  architectures  as  reported  in  Table  24.  For  the  API 


Table  24.  API  Problem:  Average  Performance  of  Selected  Architectures 


Final 
Number 
Middle  Nodes 

Average 
Minimum  SSE 

Corresponding 
%  Classification 
Error 

8 

5.014 

1.779  % 

7 

6.020 

2.313  % 

6 

4.696 

2.402  % 

problem,  eight  middle  nodes  are  sufficient  [6],  and  corresponds  to  the  results  of  experiments  one 
and  two.  In  eight  out  of  ten  experiments,  the  selected  architecture  contains  fewer  middle  nodes. 
To  summarize,  the  algorithm  selected  a  sufficient  initial  zu^chitecture  for  each  experiment,  and  the 
selected  architecture  was  more  parsimonious  80%  of  the  time. 

The  tenth  experiment  which  selected  an  architecture  of  six  middle  nodes  is  summarized  in 
Table  25.  Figure  17  shows  the  relationship  between  the  number  of  model  parameters  and  SSE 
for  each  number  of  middle  nodes  tested.  Similarly,  Figure  18  shows  the  relationship  between  the 
number  of  model  parameters  and  the  total  classification  error  for  each  number  of  middle  nodes 
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Tbble  25.  Stunmary  of  API  Problem  Architecture  Selection  (Experiment  10) 
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tested.  The  variation  in  local  fniniTniim  for  aU  10  runs  is  displayed  above  the  mininmni  SSE  run 
for  each  of  the  tested  architectures.  Although  the  selected  architecture  contained  six  middle  nodes, 
the  results  for  five  middle  nodes  are  shown  for  comparison.  More  detailed  results  of  this  experiment 
are  shown  in  an  audit  log  contained  in  Appendix  B,  Section  B.l. 

5,  S.  5  Feature  Selection  Algorithm.  In  this  section,  a  sequential  feature  selection  algorithm  is 
proposed  for  examining  candidate  feature  subsets  for  systematic  elimination  from  a  feature  set.  The 
algorithm  can  be  implemented  for  feature  sets  which  are  larger  than  necessary  which  may  contain 
irrelevant  or  redimdant  features.  Alternatively,  the  algorithm  can  also  be  used  to  sequentially 
reduce  a  feature  set  which  is  larger  than  desired. 

The  algorithm  uses  the  concepts  of  statistical  model  selection  described  in  Section  2.  Specifi¬ 
cally,  a  backwards  sequential  selection  procedure  is  used  to  systematically  investigate  reduced  neural 
network  models  using  likelihood  ratio  test  statistics  for  statistically  testing  the  reduced  models.  The 
feature  selection  algorithm  is  \mique  because  initial  architecture  selection  and  architecture  adjust¬ 
ment  are  embedded  within  the  algorithm.  Architecture  adjustment  allows  the  neurzd  network  to 
dynamically  adapt  its  architecture  as  necessary  throughout  the  feature  selection  process.  Figure  19 
depicts  the  neural  network  feature  selection  algorithm.  A  description  of  the  algorithm  follows. 
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Figure  17.  API  Problem  Architecture  Selection:  SSE  tot  10  Runs  (Experiment  10) 
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Figure  18.  API  Problem  Architecture  Selection:  %  Classification  Error  for  10  Runs  (Experiment 
10) 
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Figure  19.  Neural  Network  Feature  Selection  Algorithm 
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Neural  Network  I\eature  Selection  Algorithm 

1.  Initialise  and  define  parameters. 

•  Number  of  features  being  evaluated:  M 

•  Current  number  of  features  being  evaluated:  k. 

Initialise  k  =  M. 

•  Current  number  of  middle  nodes:  H. 

•  Number  of  neural  networks:  /. 

•  Number  of  training  exemplars:  P|,. 

•  Niunber  of  training-test  exemplars:  Pu- 

•  Total  number  of  exemplars:  P  =  +  Pt«. 

•  Number  of  neural  network  weight  parameters:  s. 
s  =  {k-\-l)H  +  {H  +  l) 

•  Significance  level  for  feature  elimination:  Oi. 

•  Significance  level  for  middle  node  elimination:  03. 

•  Set  /  =  0. 

If  desired,  /  can  also  be  set  equal  to  the  final  number  of  features  to  be  selected. 

•  Coimt=0. 

2.  Ikain  /  realisations  of  the  full  neural  network  model  with  k  features  and  H  middle  nodes. 

•  Compute  SSE  using  Equation  67. 

•  Define  SSEj^,  =  SSE  for  the  tth  neurid  network  realization. 

•  Define  SSEjr  =  min{SSE|7,,  •  •  • , SSEp,} 
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3.  lYain  /  realisations  of  the  reduced  neural  network  model  with  k  features  and  H  -  I  middle 
nodes. 

•  Compute  SSE  using  Equation  67. 

•  Define  SSE/i.  =  SSE  for  the  tth  neural  network  realisation. 

•  Define  SSE^  =  min{SSE/ti ,  •  -  • ,  SSEa,} 

4.  Compute  degrees  of  freedom  for  the  full  and  reduced  models. 

•  dip  —  P  —  a  (full  model). 

•  df/i  =  P  -  [s  -  (fc  +  2)]  (reduced  model). 

5.  Compute  the  Likelihood  Ratio  Test  Statistic:  L. 

,  r  _  [33B«-SSB,]/(df«-<lf,) 

•  —  S3Br/<», 

6.  Test  the  null  hypothesis  that  the  reduced  model  is  equivalent  to  the  frdl  model  using  the 
likelihood  ratio  test  statistic. 

•  If  If  <  Pai.dfji-dfr.dfr  >  reduced  structure  model  can  not  be  rejected. 

-  Accept  the  reduced  structure  model  (i.e.  remove  1  middle  node). 

-  H  =  H-\ 

-  s  =  s  -  [fc  +  1] 

-  SSEj,  =  SSE« 

-  dip  ^  din 

-  If  Count=l,  then  go  to  Step  2  and  investigate  a  reduced  architecture  model  before 
investigating  feature  reduction. 

If  Coimt>l,  then  go  to  Step  7 

•  If  L  >  Pai,df«-df,,df,  >  then  reject  the  reduced  structure  model. 
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-  SSE,  =  SSEjr 

-  dfjp  =  df|p 

-  Go  to  Step  7. 

7.  If  ib  =  1  or  ib  <  /,  then  go  to  Step  13  aince  no  further  feature  reduction  is  required. 

8.  Train  I  realizations  for  each  of  the  k  reduced  feature  candidate  models.  Each  candidate  modd 
has  H  middle  nodes  and  a  different  feature  set  containing  ib  —  1  of  the  k  remaining  features 
which  have  not  been  previously  eliminated. 

•  Compute  SSE  using  Equation  67. 

•  Define  SSEa.^.  =  SSE  for  the  tth  neural  network  realization  of  the  jth  reduced  feature 
modd,  where  j  =  1,  •  •  • ,  ib. 

•  Define  SSE^  =  min{SSEii„ ,  •  •  • ,  SSEr^,  •  •  • ,  SSEji,^ ,  •  •  • , SSE«,„} 

•  Sdect  as  a  candidate  feature  for  elimination,  the  feature  which  when  removed  produces 
the  model  with  the  smallest  SSEji. 

9.  Compute  degrees  of  freedom  for  the  reduced  modd. 

•  ^R  =  P  -[»-  H]. 

10.  Compute  the  Likelihood  Ratio  Test  Statistic:  L. 

,  r  _  f3SB«-33E,l/(df«-df,) 

•  "  —  S3Er/dfjp 

11.  If  L  <  0  or  0  <  /  <  ib  accept  reduced  feature  modd. 

•  Eliminate  the  candidate  feature. 

•  ib  =  i-l 

•  a  =  »  —  H 
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•  Count=Count+l 


•  Go  to  Step  2. 

12.  Test  the  null  hypothesis  that  the  reduced  model  is  equivalent  to  the  full  model  using  the 
likelihood  ratio  test  statistic. 

•  If  X  <  reduced  modd  can  not  be  rejected. 

-  Accept  the  reduced  feature  model. 

-  Eliminate  the  candidate  feature. 

-4=4-1 

-  s  =  a  -  H 

-  Count=Coimt+l 

-  Go  to  Step  2. 

•  If  X  >  and  p  =  0,  then  reject  the  reduced  feature  model. 

-  Go  to  Step  13. 

13.  Stop,  the  final  neural  network  model  has  been  determined  with  H  nodes  and  4  features. 

In  Step  1,  parameters  are  initialized  and  defined.  A  reasonable  number  of  neural  networks 
is  usually  five  or  ten.  When  trying  to  find  good  or  improved  local  minima,  there  is  a  point  of 
diminishing  returns  for  the  computational  cost  of  running  additional  networks. 

In  Steps  2  and  3,  each  training  realization  involves  a  different  training  and  training-test  set 
which  is  randomly  selected  from  among  the  total  pool  of  exemplars.  Randomly  selected  training 
and  training-test  sets  means  that  each  neural  network  run  is  trained  using  a  unique  partitioning  of 
the  data. 
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To  fairly  compare  results,  the  SSE  is  computed  in  Steps  2  and  3  using  all  P  exemplars,  so 
the  full  and  reduced  models  are  compared  using  the  same  data  set.  If  there  is  sufficient  data,  a 
validation  set  SSE  can  be  used  to  provide  the  most  unbiased  comparison  between  models. 

A  finite  number  of  full  and  reduced  model  networks  are  trained.  From  this,  the  networks 
corresponding  to  the  best  full  and  reduced  model  local  minima  are  used  to  form  the  likelihood 
ratio  in  Step  5.  In  this  scenario,  where  only  a  finite  number  of  networks  are  trained,  it  is  possible 
that  the  reduced  model  may  ;<roduce  a  lower  total  SSE  than  the  full  model.  If  the  likelihood  test 
statistic  is  negative  in  Step  6,  the  reduced  model  should  be  accepted,  since  the  best  reduced  model 
from  I  runs  is  better  than  the  best  full  model  from  /  runs. 

With  any  type  of  network  redundancies  (middle  nodes  or  features),  there  are  flats  in  the 
parameter  space  [80:106].  These  flats  can  make  convergence  to  a  good  local  miniTnnm  more  dif¬ 
ficult;  consequently,  identification  of  an  unnecessary  or  redundant  feature  may  be  more  difficult. 
Therefore,  initial  architecture  selection,  as  well  as,  architecture  adjustment  are  incorporated  into 
the  algorithm. 

Because  the  architecture  selection  is  incorporated  into  the  algorithm,  middle  node  initializa¬ 
tion  is  important  (see  Section  5.3.3).  Architecture  adjustment  is  investigated  because,  fewer  input 
features  may  be  associated  with  reduced  network  complexity.  The  algorithm  attempts  to  maintain 
a  minimal  network  structure  at  each  step,  thereby,  allowing  the  focus  to  remain  on  the  elimination 
of  unnecessary  features. 

The  feature  selection  algorithm  is  set  up  with  two  potential  stopping  criteria.  The  first 
stopping  criteria  is  the  most  classical  and  is  depicted  in  Figure  19.  Here,  the  algorithm  stops  when 
the  likelihood  ratio  test  statistic  L  is  greater  than  the  selected  critical  point  of  the  F-distribution. 
If  this  stopping  criteria  is  used,  the  parameter  /  is  set  equal  to  zero. 

The  second  stopping  criteria  involves  stopping  the  algorithm  after  a  predetermined  ntunber 
of  featmes  have  been  eliminated.  In  this  case,  the  algorithm  is  implemented  with  the  parameter 
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/  set  equal  to  the  final  number  of  features  to  be  selected.  When  this  stopping  criteria  is  used, 
the  candidate  feature  for  elimination  is  removed  at  each  step  as  long  as  the  current  number  of 
featmes  is  greater  than  the  final  number  of  feature  desired,  even  when  the  likelihood  ratio  test 
statistic  L  exceeds  the  critical  point  of  the  ^-distribution.  When  the  second  stopping  criteria  is 
used,  the  following  information  should  be  reported  at  each  step:  SSE,  £,  P-values,  classification 
error  (for  a  classification  problem),  current  features,  and  current  number  of  middle  nodes.  With 
this  information,  a  neural  network  practitioner  can  determine  the  most  appropriate  feature  set  by 
analyzing  the  tradeoffs  between  an  accurate  model  and  a  parsimonious  model. 

It  is  ideal  to  do  some  screening  and  analysis  prior  to  using  the  feature  selection  algorithm.  Ide¬ 
ally,  the  noise-like  features  have  been  identified  and  removed  using  the  featme  screening  techniques 
of  Chapter  IV. 

5.3.6  Application  of  the  Feature  Selection  Algorithm.  In  this  section,  the  feature  selection 
algorithm  is  applied  to  the  FLIR  problem  introduced  in  Chapter  m.  For  the  FLIR  results  reported 
in  Chapters  HI  and  IV,  four  middle  nodes  are  used.  Here,  the  FLIR  problem  is  initialized  with 
eight  middle  nodes  to  demonstrate  the  feature  selection  algorithm’s  effectiveness. 

The  results  shown  in  Figure  10  of  Chapter  HI,  page  101  indicate  that  one  to  three  features  can 
be  eliminated  without  a  significant  degradation  in  validation  set  results.  Using  the  second  stopping 
criteria,  the  feature  selection  algorithm  is  applied  to  select  six  good  features  with  an  appropriate 
network  architecture.  Five  neural  networks  are  used  in  this  experiment.  Each  network  is  truned 
for  a  minimum  of  500  epochs  and  improvement  in  the  test  set  SSE  is  monitored  every  50  epochs  to 
justify  continuation  of  training.  Once  the  test  set  SSE  no  longer  improves,  training  is  discontinued. 
The  parameter  initializations  for  Step  1  are: 

•  Number  of  middle  nodes:  H  =  S. 

•  Significance  level  for  statistical  testing  of  reduced  structure  models:  a  =  .05. 
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•  Significance  level  for  statistic<d  testing  of  reduced  feature  models:  a  =  .05. 

•  Number  of  netiral  network  nms:  1  =  5. 

•  Number  of  network  features:  M  =  9. 

•  Number  of  training  set  exemplars:  Ptr  =  350. 

•  Number  of  training- test  set  exemplars:  Pt,  =  200. 

•  Total  number  of  exemplars:  P  =  550. 

•  Number  of  network  weight  parameters:  s  =  89. 

•  Final  niunber  of  features  to  be  selected:  /  =  6. 

The  results  &om  applying  the  feature  selection  algorithm  to  the  FLIR  problem  are  summa¬ 
rized  in  Table  26.  For  the  reduced  models  accepted  within  each  iteration  of  the  algorithm,  Figure  20 
shows  the  relationship  between  the  number  of  model  parameters  and  minimum  total  SSE.  Sim¬ 
ilarly,  Figure  21  shows  the  relationship  between  the  number  of  model  parameters  and  the  total 
classification  error  for  each  of  the  accepted  models. 

Both  the  first  and  second  stopping  rules  are  indicated  in  Figures  20  and  21.  Using  the  first 
stopping  rule,  only  statisticeUly  equivident  reduced  models  are  selected.  The  indicated  stopping 
point  corresponds  to  a  43%  reduction  in  parameters  with  no  degradation  in  accuracy.  The  second 
stopping  rule  allows  the  practitioner  the  flexibility  to  trade  off  accuracy  for  a  more  parsimonious 
model.  In  this  case,  the  indicated  stopping  point  corresponds  to  a  63%  reduction  in  p2U’ameters 
with  only  a  small  degradation  in  accuracy.  Notice  that  using  the  second  stopping  rule  uncovers 
three  additional  models  for  consideration.  The  best  of  these  additional  models  occius  with  four 
middle  nodes  and  seven  features.  More  detailed  results  of  this  experiment  are  shown  in  an  audit 
log  contained  in  Appendix  B,  Section  B.2. 

Since  saliency  metric  results  in  Figure  10  indicate  that  one  to  three  feattires  can  be  eliminated, 
the  six  most  S2dient  features  would  be  a  logical  alternative  for  selecting  six  good  features.  However, 
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Figure  20.  FLIR  Problem  Feature  Selection:  Minimum  SSE’s 
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Figure  21.  FLIR  Problem  Feature  Selection  Experiment:  %  Classification  Error  for  Minimum 
SSE  Networks 
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l^ble  26.  Summary  of  a  Feature  Selection  Experiment  for  the  FLIR  Problem 
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eliminating  the  three  least  salient  features  does  not  take  into  account  that  a  seemingly  unimportant 
feature  may  be  interacting  in  an  important  way  with  the  remaining  features.  Whereas,  the  sequen- 
ticil  feature  selection  algorithm  described  in  this  section  partially  takes  the  feature  interactions  into 
account  when  selecting  a  feature  subset. 

Indeed,  the  feature  selection  algorithm  does  not  select  the  three  least  important  features  for 
elimination.  Out  of  the  nine  FLIR  features  ,  the  3rd,  6th  and  8th  ranked  salient  features  using 
in  Table  12  on  page  98  are  eliminated  using  the  feature  selection  procedure.  Figures  22  and  23  can 
be  studied  to  compare  the  error  rates  over  800  epochs  of  training  for  the  ‘selected  feature  subset’ 
versus  the  ‘salient  feature  subset.’  The  ‘salient  feature  subset’  contains  the  six  most  salient  features 
using  the  sediency  metric 

In  Figmes  22  and  23,  the  minimum  SSE  and  the  corresponding  percent  classification  error 
from  10  runs  are  plotted  for  the  P  vectors  used  in  the  feature  selection  experiment.  The  results 
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Figure  22.  FLIR  Problem  Feature  Subsets:  Minimum  SSE 

100 

80 
60 

SSE 

40 
20 

0 
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shown  for  the  minimum  SSE  network  tell  the  same  story  as  the  average  results  over  the  10  runs. 
The  average  SSE  for  the  selected  features  versus  the  most  salient  features  was  24.58  and  30.36, 
respectively  at  800  epochs.  Similarly,  the  average  percent  classification  error  for  the  selected  features 
versus  the  most  salient  features  was  5.23  and  6.96,  respectively  at  800  epochs.  The  average  percent 
classification  error  is  significantly  better  (at  the  0.05  statistical  significance  level)  for  the  subset  of 
selected  features.  This  illustrates  a  quantifiable  advantage  to  using  the  selection  procedure  on  the 
FLIR  problem. 

5.4  Summary 

Since  a  neural  network  can  be  viewed  as  a  nonlinear  regression  model,  nonlinear  regression 
model  selection  concepts  are  applicable  for  neural  networks.  This  chapter  reviews  nonlinear  re¬ 
gression  model  selection  and  the  practical  considerations  encountered  in  a  neural  network  model 
selection  scenario.  Nonlinear  regression  model  selection  is  the  basis  for  proposing  architecture  and 
feature  selection  algorithms  for  neural  networks.  Both  algorithms  use  the  likelihood  ratio  test 
statistic  (shown  in  Equation  70  on  page  143)  within  a  backwards  sequential  selection  procedure. 

The  first  algorithm  is  an  architecture  selection  algorithm  for  determining  a  good  number 
of  middle  nodes.  The  algorithm  is  used  to  automate  what  is  often  a  process  of  trial  and  error 
experimentation  for  many  practitioners.  Application  results  from  ten  initial  architecture  selection 
experiments  using  the  API  problem  data  are  presented.  A  reduced  architecture  was  determined  for 
this  problem  for  eight  of  the  ten  experiments. 

The  second  algorithm  is  a  feature  selection  algorithm  for  statistically  investigating  reduced 
feature  subsets.  Embedded  in  the  algorithm  is  initial  architecture  selection  and  architecture  ad¬ 
justment.  The  network  architectu’e  is  dynamically  adapted  as  necessary  throughout  the  featiire 
selection  process.  Application  results  from  a  feature  selection  experiment  on  the  FLIR  problem 
demonstrate  that  43%  and  a  63%  reduction  of  parameters  for  using  two  different  stopping  rules. 
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The  43%  parameter  reduction  produced  approximately  the  same  prediction  accuracy.  This  re¬ 
duction  corresponded  to  accepting  only  statiatically  equivalent  reduced  models  using  the  likelihood 
ratio  test  statistic.  The  63%  parameter  reduction  produced  a  slightly  degraded  prediction  2u:curacy, 
and  it  corresponded  to  stopping  the  algorithm  after  three  features  had  been  eliminated,  whether 
or  not  the  reduced  model  was  statistically  equivalent.  When  the  six  most  salient  features  are  com¬ 
pared  to  the  ‘selected  feature  subset,’  the  prediction  accuracy  from  the  ‘salient  feature  subset’  is 
significantly  lower  than  the  ‘selected  feature  subset’  at  the  .05  statistical  significance  level.  This 
result  is  possible  because  the  feature  selection  algorithm  partly  takes  the  correlation  structure  of 
the  features  into  account. 

In  the  next  chapter,  a  comprehensive  neural  network  selection  methodology  is  developed  for 
identifying  both  a  good  feature  set  and  an  appropriate  neural  network  architecture  for  a  specific 
situation.  The  methodology  combines  both  the  statistical  screening  procedure  from  Chapter  IV  and 
the  statistical  architecture  and  feature  selection  algorithms  into  a  comprehensive  statistically-based 
approach  for  neural  network  model  selection. 
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VI.  A  Comprehensive  Neund  Network  Selection  Methodology 


6.1  Introduction 

In  Chapter  IV,  a  statistical  screening  procedure  is  developed  for  the  identification  of  notse- 
like  irrelevant  features  using  the  saliency  metrics  developed  in  Chapter  IQ.  In  Cluster  V,  neural 
network  selection  algorithms  for  architecture  selection  and  feature  selection  are  proposed.  In  this 
chapter,  a  comprehensive  neural  network  selection  methodology  is  presented  which  logically  inte¬ 
grates  Vhe  screening  and  selection  algorithms  into  an  overall  statistically-based  approach  for  neural 
network  model  selection.  As  in  the  previous  chapters,  application  of  the  proposed  methodology 
is  investigated  using  a  single  hidden  layer  architecture  with  sigmoidal  activation  function  on  the 
middle  and  output  layers.  However,  this  methodology  is  general  and  is,  therefore,  appropriate  for 
any  feedforward  neural  network  architecture  using  a  variety  of  activation  functions.  Changes  in 
architecture  or  activation  functions  would  'merely  change  the  definitions  of  the  neural  network  error 
function  which  is  minimized.  This  in-tum  ch2uiges  the  definitions  of  the  saliency  function  described 
in  Chapter  IQ. 

6.2  Comprehensive  Neural  Network  Selection  MeAodology 

6.2.1  Introduction.  The  methodology  proposed  in  this  section  is  the  culmination  of  this 
dissertation  effort.  The  proposed  methodology  represents  the  logical  combination  of  research  results 
from  Chapters  IQ,  IV,  and  V  into  an  overall  procedure  for  dynamically  identifying  both  a  good 
feature  set  and  an  appropriate  neural  network  structure  for  a  specific  situation. 

6.2.2  Comprehensive  Selection  Methodtdogy.  The  comprehensive  selection  methodology  de¬ 
picted  in  Figure  24  is  composed  of  three  separ^e  modules  within  the  algorithm:  (1)  initial  archi¬ 
tecture  selection,  (2)  saliency  screening,  and  (3)  feature  selection. 
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The  initial  architecture  module  is  described  in  Section  3  of  Chapter  V.  The  saliency  screening 
module  is  described  in  Section  2  of  Chapter  IV.  The  feature  selection  module  is  described  in 
Section  3  of  Chapter  V.  A  description  of  the  comprehensive  neural  network  selection  methodology 


is  depicted  in  Figure  24  follows. 

Comprehensive  Neural  Network  Selection  Methodology 

1.  Initialize  and  define  parameters  for: 

•  Initial  architecture  selection  algorithm. 

•  Saliency  screening  procedure. 

•  Feature  selection  algorithm. 

2.  Perform  initial  architecture  selection  algorithm. 

3.  Perform  saliency  screening  and  remove  any  noiae~like  features  which  are  identified. 

4.  Perform  feature  selection  algorithm. 

When  no  additional  features  can  be  removed  based  on  the  stopping  rules,  continue  to  Step  5. 

5.  Stop,  a  final  neural  network  has  been  selected  for  the  problem  at  hand. 

Compute  the  associated  feature  saliencies. 

The  comprehensive  selection  methodology  calls  for  finding  an  appropriate  initial  neural  net¬ 
work  architecture,  and  then  screening  and  eliminating  any  noise-like  features.  Next,  the  methodol¬ 
ogy  calls  for  iteratively  investigating  the  elimination  of  unnecessary  or  redundant  middle  nodes  and 
features.  When  no  further  features  can  be  removed,  the  saliencies  are  calculated  for  the  remaining 
features  using  the  final  network  architecture. 

With  any  type  of  network  redundancies  (middle  nodes  or  features),  there  are  flats  in  the 
parameter  space  [80:106].  These  flats  can  make  convergence  to  a  good  local  minimum  more  dif¬ 
ficult;  consequently,  identification  of  an  uimecessary  or  redimdant  feature  may  be  more  difficult. 
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Therefore,  as  unnecessary  features  are  removed,  the  network  structure  is  dynamically  reduced  as 
necessary.  The  algorithm  attempts  to  maintain  a  minimal  network  structure  at  each  step,  thereby, 
allowing  the  algorithm  to  focus  on  the  elimination  of  unnecessary  features. 

6.2.3  Scope  of  Application  Two  issues  are  discussed  in  this  section.  First  the  scope  of  the 
algorithm’s  application  is  discussed.  Second,  the  prospects  for  parallel  processing  are  discussed. 

Due  to  the  multi-module/multi-iteration  characteristics  of  the  proposed  network  selection 
algorithm,  it  is  most  efficient  and  prudent  for  problems  where  the  total  number  of  candidate  features 
is  a  manageable  number  of  at  most  20  or  30  features.  This  means  the  practitioner  may  need  to 
apply  screening,  intmtion,  or  feature  rotations  to  reach  a  reasonable  number  of  features  before 
using  the  selection  methodology. 

The  multi-iteration  characteristic  of  the  proposed  network  selection  algorithm,  makes  the 
algorithm  amenable  to  parallel  computation.  For  the  architecture  and  feature  modules,  multiple 
iterations  are  used  to  ensure  that  a  good  local  minimum  is  found.  All  of  these  iterations  could  be 
parsed  out  and  trained  on  separate  networks;  subsequently,  the  results  from  the  trained  networks 
could  be  centrally  analyzed.  In  the  saliency  screening  module,  multiple  iterations  are  used  so 
saliency  statistics  can  be  collected  for  statistical  hypothesis  testing.  Similar  to  the  architecture 
and  feature  modiiles,  the  iterations  could  be  parsed  out  and  trained  on  separate  networks,  and  the 
statistics  could  be  centrally  collected  and  analyzed. 

6.2.4  Application  of  the  Comprehensive  Selection  Algorithm.  To  test  the  comprehensive 
selection  algorithm,  the  XOR  problem  was  modified  to  include  the  following  features:  x,  y,  ni,  nj, 
and  mi.  The  true  features  shown  in  Figure  7,  piq^e  90  are  x  and  y;  ni  and  n2  are  augmented  noise 
features;  and  mi  is  an  augmented  feature  which  is  highly  correlated  with  x.  The  feature  mi  is 
defined 

mi  =  *  +  0.01  UNF(0,1) 


175 


The  modified  XOR  problem  was  designed  to  demonstrate  that  the  ssdiency  screening  could  be 
used  to  identify  and  eliminate  one  or  both  of  the  noise  features,  and  that  the  feature  module  could 
be  used  to  identify  and  eliminate  either  z  oi  mi,  since  the  presence  of  both  features  constitutes 
redimdant  feature  information. 

The  selection  experiment  is  performed  with  five  network  runs  for  the  architecture  and  feature 
modules,  and  ten  network  runs  for  the  saliency  modules.  In  the  feature  selection  module,  the 
first  stopping  criteria  is  used.  This  means  the  algorithm  stops  when  no  additional  feattires  can  be 
removed  using  the  likelihood  ratio  test  statistics.  The  networks  are  trained  for  a  minimum  of  500 
epochs  and  improvement  in  the  test  set  SSE  is  monitored  every  50  epochs  to  justify  continuation 
of  training.  Once  the  test  set  SSE  no  longer  improves,  training  is  discontinued.  The  parameter 
initializations  for  Step  1  are: 

•  Number  of  middle  nodes:  H  =  6. 

•  Significance  level  for  statistical  testing  of  reduced  structure  models:  ai  =  .05. 

•  Significance  level  for  statisticed  testing  of  reduced  feature  models;  aj  =  .05. 

•  Individual  significance  level  for  saliency  screening  for  noise-like  features:  as  =  .01  (family 
significance  level  of  0.05) 

•  Number  of  architecture  and  feature  module  neural  network  runs:  5. 

•  Number  of  saliency  module  neural  network  runs:  10. 

•  Number  of  network  features:  M  =  5. 

•  Number  of  training  set  exemplars:  P,,  =  300. 

•  Number  of  training-test  set  exemplars:  Pt,  =  200. 

•  Total  number  of  exemplars:  P  =  500. 

•  Number  of  network  weight  parameters:  s  =  43. 
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Table  27.  Selection  Experiment  with  the  Modified  XOR  Problem 
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Table  27  summarizes  the  experiment.  The  saliency  screening  procedure  is  separately  sununa- 
rized  in  Table  28,  since  the  format  of  Table  27  is  not  adequate.  The  initial  architecture  selection 
module  reduces  the  network  architecture  from  six  middle  nodes  to  five  middle  nodes.  The  saliency 
screening  module  identifies  both  rii  and  n2  as  noise-like,  which  reduces  the  network  model  to  five 
middle  nodes  and  three  features.  In  the  feature  modide,  the  architecture  is  reduced  to  ic  oid- 
dle  nodes,  and  the  reduced  feature  model  with  y  zmd  mi  is  identified  as  equivalent  to  the  full 
feature  model  with  x,  y,  and  mi.  In  summary,  the  comprehensive  feature  selection  procedure  is 
used  to  sequentially  reduce  the  neural  network  model  from  43  parameters  to  17  parameters  using 
likelihood  ratio  test  statistics  to  sequentially  identify  reduced  models  which  are  approximately  sta¬ 
tistically  equivalent.  This  represents  a  60%  reduction  in  weight  parameters,  with  no  degradation 
in  prediction  or  classification  accuracy. 

More  detailed  results  of  this  experiment  are  shown  in  an  audit  log  contained  in  Appendix  B, 
Section  B.3. 
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T^ble  28.  Saliency  Screening  Results  on  XOR  Problem  for  10  Ruiu 
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6.3  Summary. 

The  comprehensive  selection  methodology  proposed  in  this  Chapter  represents  the  logical  in¬ 
tegration  of  the  statistical  screening  and  selection  procedures  of  Chapters  IV  and  V,  along  with  the 
saliency  metrics  of  Chapter  m.  The  methodology  is  designed  for  identifying  both  a  good  feature  set 
and  a  potentially  reduced  neural  network  architecture  for  a  specific  situation.  The  scope  of  appli¬ 
cation  and  the  potential  for  parallel  implementation  are  discussed.  An  application  experiment  on  a 
modified  XOR  problem  demonstrate  the  utility  of  the  methodology  for  identifying  and  eliminating 
both  noise-like  and  redundant  features,  while  adjustii^  the  neural  network  architecture  as  neces¬ 
sary.  In  Chapter  VII,  the  dissertation  research  is  summarized  and  future  research  recommendations 
are  made. 
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VII.  Summary  and  Recommendations 


7.1  Introduction 

Feature  and  model  selection  for  feedforward  neural  networks  are  advanced  in  this  research. 
Feature  selection  involves  determining  a  good  feature  subset  from  a  set  of  candidate  features,  and 
the  process  of  feature  selection  is  characterized  by  three  components  in  this  research:  (1)  a  metric  or 
criterion  function  for  evaluating  and  ranking  the  features,  (2)  a  procedure  for  identifying  irrelevant 
and  redundant  features,  and  (3)  a  search  methodology  for  examining  candidate  feature  subsets. 
Model  selection  involves  determining  an  appropriate  architecture  (number  of  middle  nodes)  for  the 
neural  network.  In  this  chapter,  contributions  advancing  the  process  of  neural  network  feat'ue  and 
model  selection  are  summarized  and  recommendations  for  future  research  are  made. 

7.2  Summary 

This  research  begins  with  a  review  of  feature  selection  techniques  developed  for  regression, 
discriminant  analysis,  and  neural  networks.  Because  neural  network  appLcations  include  problems 
which  hare  often  been  solved  using  classical  regression  and  discriminant  analysis  techniques,  non* 
neural  feature  selection  techniques  are  reviewed  for  their  potential  use  in  a  neural  network  context. 
What  follows  is  a  summary  of  the  research  advances  and  contributions  made  in  the  areas  of  feature 
saliency,  identification  of  noisy  features,  and  neural  network  model  selection. 

7.2.1  Feature  Saliency  Metrics.  Feature  saliency  metrics  are  used  to  measure  the  relative 
importance  of  a  feature  with  respect  to  a  trained  neural  network.  This  research  consolidates  the  set 
of  available  neural  network  feature  saliency  metrics  by  developing  a  catalogue  of  feature  saliency 
metric  definitions  and  interrelationships. 

In  this  research,  a  framework  is  developed  and  used  for  analyzing  a  variety  of  derivative- 
based  metrics.  Several  of  the  derivative-based  saliency  metrics  are  evaluated  for  their  sensitivities 
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to  sampling,  training,  and  redimdant  middle  nodes.  The  metrics  do  not  appear  to  be  particularly 
sensitive  to  sampling;  however,  they  are  sensitive  to  redundant  middle  nodes  and  the  amount  of 
training.  It  is  most  important  to  eliminate  redundant  middle  nodes,  since  the  metrics  are  most 
sensitive  to  training  effects  in  the  presence  of  redundant  middle  nodes.  Increased  training  may  cause 
the  weights  associated  with  the  redundant  nodes  to  grow  disproportionately.  This  can  contaminate 
saliency  results,  since  the  weights  from  irrelevant  features  can  get  large. 

A  theoretical  relationship  is  shown  between  derivative-based  and  weight-based  saliency.  In 
summary,  the  derivative-based  feature  saliency  metrics  are  bounded  above  by  a  constant  linear 
combination  of  the  feature  weights.  At  one  middle  node,  the  ‘saliency  function’  ratios  produced 
with  the  weight-based  metrics  and  derivative-based  metrics  are  equal,  to  within  roundoff  error. 
When  additional  middle  nodes  are  used,  empirical  results  indicate  that  the  relative  saliency  of 
important  to  unimportant  features  is  smaller  with  weight-based  saliency  than  it  is  for  derivative- 
based  saliency.  For  problems  with  redundant  middle  nodes,  this  is  partly  due  to  the  growth  of  the 
irrelevant  weights  associated  with  redundant  middle  nodes. 

Contributions  are  also  made  in  the  area  of  Bayesian-based  feature  saliency  metrics.  First, 
a  succinct  and  exact  relationship  is  demonstrated  between  a  previously  suggested  Bayesian-based 
metric  and  derivative-based  saliency.  Then  a  new  Bayesian-based  saliency  metric  using  the  partial 
derivative  of  classifier  error  is  introduced.  The  computation  of  this  metric  requires  only  a  subset  of 
the  terms  associated  with  the  previously  suggested  Bayesian-based  metric.  Finally,  the  relationship 
between  the  new  Bayesian-based  saliency  and  derivative-based  saliency  is  derived,  and  an  upper 
bound  for  the  Bayesian-based  saliency  is  defined.  For  a  two  class  problem,  the  new  metric  produces 
results  exactly  equivalent  to  derivative-based  saliency. 

The  catalogue  of  metrics  are  evaluated  for  a  ‘real  world’  problem.  On  this  problem,  saliency 
rankings,  ‘saliency  function’  ratios,  and  factor  analysis  are  used  to  empirically  evaluate  similarities 
and  differences  between  the  saliency  metrics.  One  similarity  is  that,  despite  differences,  all  of 
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the  metrics  consistently  ranked  a  set  of  ‘nonessential’  features  last.  Since  the  metrics  perform 
differently,  recommendations  are  me  :  for  selecting  a  feature  saliency  metric. 

For  discriminant  analysis  problems  using  networks  with  more  than  one  middle  node,  the  new 
saliency  metric  in  Equation  53  on  page  86  is  preferable  to  in  Equation  31  on  page  68  for 
two  reasons.  First,  it  is  intuitively  appealing,  since  it  represents  a  saliency  metric  which  is  related 
to  the  average  cl2i8sifier  Pg.  Second,  it  is  more  succinctly  defined  using  only  a  subset  of  the  terms 
required  for  computation  of 

For  function  approximation  or  discriminant  analysis  problems  using  networks  with  more  than 
one  middle  node,  the  proposed  saliency  metric  Af*‘*  in  Equation  31  on  page  68  should  be  preferred 
over  Aj.  The  saliency  metric  Aj’’*”  provides  good  feature  rankings,  is  more  succinctly  defined  than 
A^,  and  is  based  on  information  known  about  the  data  from  feature  space. 

For  any  classification  or  function  approximation  problem  using  a  network  with  only  one  middle 
node,  the  weight-based  metrics  (see  Equation  32  on  page  78)  are  best.  In  this  case,  the  relative 
saliencies  produced  using  weight-based  saliency  will  be  identical  to  Fj  or  Aj‘“.  For  networks  using 
more  than  one  middle  node,  weight-based  saliency  can  still  be  used  for  a  cursory  analysis  of  a 
feature’s  relative  importance.  However,  the  empirical  residts  suggest  that  the  relative  importance 
of  one  feature  to  another  is  degraded  when  additional  middle  nodes  are  used. 

7.2.2  Identification  of  Noisy  Features.  A  saliency  screening  procedure  for  identifying  noisy 
features  is  developed  based  on  statistically  comparing  the  mean  sediency  of  candidate  features  to 
the  mean  saliency  of  a  noisy  feature.  This  research  extends  the  work  of  Belue  and  Bauer  to  jointly 
screen  an  entire  feature  set  for  irrelevant  features  using  a  pmred-t  hypothesis  test.  Irrelevant  features 
are  successfully  identified  over  a  series  of  test  problems  using  the  proposed  saliency  screening 
procedure.  The  procedure  is  robust  for  identifying  irrelevant  features,  even  in  the  presence  of  input 
redundancies. 
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The  saliency  screening  technique  is  compared  to  the  irrelevant  input  hypothesis  test  proposed 
by  White  [79].  The  same  test  problems  are  used  to  evaluate  the  weight  screening  procedure.  On 
any  single  run,  the  weight  screening  results  were  not  reliable.  When  average  test  statistics  from 
several  runs  are  used,  the  weight  screening  procedure  does  provide  comparable  results  for  two  of 
the  three  problems. 

1.2.S  Neural  Network  Model  Selection.  Two  novel  neural  network  selection  algorithms  are 
developed  by  posing  the  neural  network  model  as  a  nonlinear  regression  statistical  model.  Both 
algorithms  are  developed  using  the  likelihood  ratio  test  statistic  (shown  in  Equation  70  on  page  143) 
within  a  backwards  sequential  selection  procedure.  The  first  algorithm  is  an  architecture  selection 
algorithm  which  automates  the  process  of  determining  an  appropriate  number  of  middle  nodes. 
The  second  is  a  feature  selection  algorithm  for  statistically  investigating  reduced  feature  subsets. 
The  feature  selection  algorithm  is  unique  because  architecture  reduction  is  investigated  as  features 
are  removed.  The  feattire  selection  algorithm  indirectly  takes  the  correlation  structure  of  the 
features  into  account.  For  this  reason,  a  better  reduced  feature  set  may  be  identified  using  the 
feature  selection  algorithm  versus  using  a  subset  of  the  most  salient  features.  Application  results 
demonstrate  how  these  algorithms  can  be  used  to  search  for  a  more  parsimonious  neural  network 
model  with  equivalent  prediction  accuracy. 

A  comprehensive  neural  network  selection  methodology  is  developed  for  identifying  both  a 
good  feature  set  and  an  appropriate  neural  network  structure  for  a  specific  situation.  It  encompasses 
a  combination  of  statistical  screening  and  statistical  architecture  and  feature  selection.  Application 
results  demonstrate  how  the  comprehensive  methodology  can  be  used  for  identifying  and  eliminating 
both  noise-like  and  redundant  features,  as  well  as  reducing  the  number  of  middle  nodes  in  the  neural 
network  architecture. 
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l.S  Recommendations 


There  are  related  research  topics  which  could  not  be  adequately  handled  within  this  research 
effort.  Four  of  the  research  topics  could  be  pursued  with  worthwhile  benefits. 

The  first  research  topic  is  an  extension  of  the  model  selection  procedures  in  Chapters  V  and  VI. 
These  single  output  procedures  need  to  be  extended  for  a  multi-output  neural  network.  A  covariance 
adjusted  sum  of  squared-errors  must  be  computed  for  the  midtivariate  response  likelihood  ratio  test 
statistic.  This  involves  separately  estimating  the  covariance  matrix  of  the  neural  network  responses. 

The  second  research  topic  is  the  development  of  an  automatic  stopping  point  selection  for 
neural  network  training  runs.  The  procedures  proposed  in  Chapters  FV,  V,  and  VI  require  multiple 
runs  of  trained  neural  networks.  Normally,  trained  networks  correspond  to  a  miniTnnm  training-test 
set  error.  However,  a  trained  network  is  difficult  to  automatically  identify  within  a  multiple  run 
procedure  for  several  reasons.  First,  a  network  may  take  fewer  or  greater  epochs,  from  run  to  run, 
depending  on  the  gradient  descent  path  taken  with  the  backpropagation  algorithm.  Second,  there 
is  a  great  deal  of  variability  in  a  network’s  error  during  training.  Third,  there  may  be  severzd  epochs 
where  the  test  set  error  does  not  improve  prior  to  eventual  convergence.  Fourth,  neural  networks 
may  converge  to  a  variety  of  local  minima,  to  saddle  points,  or  even  diverge,  making  it  unrealistic 
to  expect  all  networks  to  reach  a  predetermined  target  value. 

The  model  selection  algorithms  described  in  Chapter  V,  depend  on  the  network’s  SSE  which 
can  be  biased  low  or  high  depending  on  where  the  stopping  point  is  selected.  In  this  research,  prior 
knowledge  was  used  to  select  a  minimum  number  of  training  epochs,  and  then  the  network  was 
trained  as  long  as  periodic  monitoring  indicated  a  reduction  in  the  test  set  error.  This  procedure 
should  be  improved.  A  relatively  simple  yet  accurate  method  for  automatic  stopping  point  selec¬ 
tion  is  needed  for  practical  implementation  within  the  multiple  runs  procedures  proposed  in  this 
dissertation. 
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The  third  research  topic  is  the  development  of  a  diagnostic  tool  for  identifying  neural  network 
redundancies.  Throughout  this  research,  feature  space  and  middle  node  redundancies  were  an  issue. 
There  is  a  great  need  for  a  statistical  or  non-statistical  technique  for  reliably  diagnosing  the  existence 
of  these  redundancies  in  netiral  networks.  The  specification  robust  irrelevant  input  hypothesis  can 
only  be  reliably  used  in  the  absence  of  these  redundancies  [80].  If  these  redundancies  are  difficult  to 
diagnose  and  remedy  a  priori,  then  procedures  are  needed  which  are  robust  to  the  possible  presence 
of  these  redundancies. 

The  fourth  research  topic  is  the  development  of  residual  analysis  techniques  for  a  neural 
network  practitioner.  There  is  a  dearth  of  information  available  on  residual  analysis  of  neural 
network  models.  Residual  analysis  is  well  documented  in  a  regression  setting  for  examining  the 
model  aptness,  but  has  not  been  documented  in  a  neural  network  setting.  Although  neural  network 
residuals  are  not  generally  normally  distributed,  residual  analysis  techniques  should  be  developed 
which  wotdd  be  of  practical  benefit  to  a  practitioner. 
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Appendix  A.  Derivation  and  Detaiis  of  Two  SaUency  Metrics 


A.l  Second  Order  Feature  Evaluation  Metric 


Le  Cun  and  others  introduced  a  technique  called  Optimal  Brain  Dami^e  for  reducing  the  sise 
of  a  neural  network  by  selectively  deleting  weights  or  units  from  a  neural  network  based  on  their 
saliencies  [39].  The  technique  can  be  applied  to  feature  evaluation  when  all  the  weight  parameters 
from  a  feature  are  evaluated  together. 


Le  Cun  and  others  approximate  the  error  of  a  neural  network  using  a  Taylor’s  series  expansion, 
where  neural  network  error,  So,  refers  to  a  function  of  the  squared  error  [39].  A  perturbation  of  the 
weight  parameters  connected  to  feature  t  wiU  change  the  neural  network  error  by 
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This  saliency  metric  measures  the  impact  to  S„  when  deleting  the  feature  to  middle  node 
weights  for  each  feature.  When  a  non-salient  feature’s  assodated  weights  are  ddeted,  the  change 
in  6o  is  small.  When  a  highly  salient  feature’s  associated  weights  are  deleted,  the  change  in  6„  is 
large.  In  order  to  make  it  computationally  practical  to  evaluate  the  Taylor  series  expression  for  the 
change  in  error,  Le  Cun  and  others  make  three  simplifying  assumptions  [39]: 


1.  They  assume  they  are  at  a  local  minimum  of  the  error  which  makes  the  first  order  terms 
equal  to  zero. 

2.  They  make  a  diagonalizing  assumption  to  eliminate  cross  terms. 

3.  They  make  a  quadratic  approximation  assumption  which  implies  that  3rd  order  and  higher 
order  terms  are  negligible. 
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What  remains  is  the  second  order  terms,  which  should  be  positive  at  a  local  minimum.  This  means 
any  change  in  error  due  to  perturbation  of  a  parameter  will  be  an  increase  in  error.  The  simplifying 
^sumptions  reduce  Equation  74  to 


d^o=\Y.hjj{dwy)\ 


The  neural  network  expression  for  £„  is 
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To  show  the  detailed  neural  network  notation  for  Equation  75,  the  diagonal  of  the  Hessian  hjj  must 
be  derived  in  detail: 
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Using  the  Levenberg  and  Marquardt  approximation,  the  second  term  is  assumed  to  be  neg¬ 
ligible  [39],  [50:523].  Essentially,  this  assumption  is  good  when  the  second  term  is  zero  (as  in  the 
linear  case),  or  when  the  second  term  is  small  or  negligible  compared  to  the  term  involving  the 
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first  derivative  [50:523].  Also  the  term  multiplying  the  second  derivative  in  the  second  term  is 
[dfc  -  2k],  which  should  just  be  the  random  measurement  error  of  each  point  for  a  trained  network. 
This  error  should  fluctuate  randomly  around  zero  smd  should  in  general  be  uncorrelated  with  the 
network  model,  so  the  second  derivative  terms  would  tend  to  cancel  out  when  summed  over  entire 
data  set  [50:523].  When  the  second  term  is  negligible,  hjj  is  guaranteed  to  be  positive,  and  can  be 
approximated  as 
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Substituting  hjj  into  Equation  75  gives 


(76) 


where 


^2k  -2  f  1  2 

aw;  = 


and  61  and  Sj  are  defined  in  Chapter  1,  Section  3.  Also,  dtv}j  when  the  weight  parameter  is  elim¬ 
inated  is  -wlj.  Making  these  substitutions,  the  detailed  neural  network  notation  for  Equation  76 
becomes 

de„  =  I 

^  j=ik=i 

which  is  the  definition  of  error  corresponding  to  the  saliency  s<(x>’)  of  feature  i  for  the  pth  input 
vector,  i.e. 

^ j=i k=i 

The  second  order  saliency  metric,  Si  is  formed  by  averaging  2i(x)  over  all  P  data  vectors: 


(77) 
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A. 2  Relevance  Feature  Evaluation  Metric 

Moser  and  Smolensky  propose  a  saliency  technique  for  measuring  the  relevance  of  a  neural 
network  feature.  The  relevance  of  a  neural  network  feature  t,  pi,  is  measured  as  a  function  of 
how  well  the  network  performs  with  the  feature  versus  how  well  the  network  performs  without  the 
feature. 

Pi  —  ^without  featnte  •  ~  ^whh  leatnie  i 


where  neural  network  error  £  is  defined  as  the  mean  absolute  error  over  all  P  vectors:  €  = 

Er=iEf=iK --**(*',  w)|. 

A  relevance  factor  is  introduced  which  corresponds  to  the  attentional  strength  of  feature 
t,  Zj.  Moser  and  Smolensky’s  relevance  factor  a,-  is  associated  with  the  feature  unit  Zj.  Essentially, 
Moser  and  Smolensky  consider  two  discrete  levels  of  relevance:  =  0  and  a«  =  1,  corresponding 

to  ^without  feature  >  <^d  fwith  feature  i  respectively.  The  neuTal  network  output  can  be  defined  as: 


Zk  =  f 


where  cti  is  associated  with  feature  Zj  as  follows 


Moser  and  Smolensky  propose  approximating  pi  using  the  derivative  of  the  error  with  respect 


to  a^: 
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and  they  assume  this  relationship  holds  approximately  for  7  =  0,  giving 
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Using  this  definition  of  neural  network  error,  the  relevance  of  feature  t,  can  be  drived  in 


detailed  neural  network  notation  as  follows 
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The  relevance  or  saliency  pi  is  essentially  a  weighted  average  of  the  terms  which  comprise 
^data  jg  defined  here  for  completeness. 
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If  a  relevance  factor  a  is  defined  which  is  associated  with  the  weight  parameters  connected 
to  a  feature,  rather  than  the  feature  itself,  the  resulting  relevance  of  feature  i  would  be: 

Pi  =  P~^'E,m2 

p=ii=i»=i' 


which  is  very  similar  »{  defined  in  Equation  77  in  Sectionsec.dalone. 
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Appendix  B.  Model  Selection  Audit  Runs 
B.l  Application  of  the  Architecture  Selection  on  the  API  Problem 

OVBRAU  NSURAL  NBTWORK  SBLBCTIONi  AUDIT  TRAIL 


Nmmb«r  of  ANN  iteratioB*  lo  Aad  good  IocbI  biIb 
ia  atckitectaro  aad  fcAtaro  modalot  10 

Nambor  of  ANN  itoratioat  for  lalioacj  icrcoalag  modalo  tiatiaiic*  10 

laitial  model  ettaclato  Ui 

NUBiBaR  OP  INITIAL  MIDDLB  NODBS  • 

NUMBBR  OP  INITIAL  PBATURBS  4 
NUMBBR  OP  OUTPUTS  1 
1  OUTPUTS  USBD  POR  3  CLASSBS 
NUMBBR  OP  VBCTORSi  TRAIN  •  TBSTt  1«1  100 
NUMBBR  OP  INITIAL  TRAINDfa  BP  OCRS  BOO 

NUMBBR  OP  ADDITIONAL  TRAININO  BPOCHS  BBTWBBN  TBST  SBT  SSB  BVALUATIONS  (m)  BO 

IMPROVBMBNT  RBQinRBD  BBTWBBN  TBST  SBT  BVALUATIONS  POR  TRAINING  TO  CONTINUB  (1-C3)R  =  B.00000B-039( 
TRAINING  CONTINUBS  lP]tft.Mc(l]  ;  TS-SSB[l-m]*C3 
LOG  docUaiag  learalag  ralot  ated 
MOMBNTUM  STBP  S12B  C3  0.300000 

ONLY  THB  STATISTICALLY  INDICATBD  PBATURBS  ARB  RBMOVBD 

STATISTICAL  SIGNIP  LBVBLSi  ARCR^BAT,SAL=  B.OOOOOOOOOOOOOD-Oa  B.OOOOOOOOOOOOOD-Oa 
8.0000000000000D‘03 
RANDOM  NUMBBR  SBBD  10.00000 

//////////////////////////////////////////// 

/////////////////////////////////////// 

“••••♦••'•■••INITIAL  ARCHITBCTURB  SBLBCTION  mODULB:****^^**^“*^*^“^****»* 

MODBL  SBLBCTION  ITBRATION  1 

CURRBNT  NUMBBR  MIDDLB  NODBS/PBATURBS  -  PULL  MODBL  «  4 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
1111 

SUMMARY  OP  10  RUNS 


1  #cp>=  550  PULL  NBRRs 

3.t4«M 

%  TOTAL  SSB= 

7.34663 

3  #cpi=  TOO  PULL  XBRR= 

3.01450 

%  TOTAL  SSB= 

7.40377 

3  #epi=  550  PULL  KBRR= 

4.05331 

%  TOTAL  SSB= 

0.01056 

4  #ep.s  500  PULL  KBRRs 

4.43633 

%  TOTAL  SSBs 

10.40360 

5  #cpi3  550  PULL  XBRR= 

3.14605 

K  TOTAL  SSB= 

7.00167 

5  #ep.=  550  PULL  NBRR= 

3.01460 

%  TOTAL  SSB= 

9.63053 

T  #<pis  550  PULL  XBRRs 

4.aro46 

K  TOTAL  SSBs 

7.55130 

5  #.p.s  550  PULL  KBRR= 

6.04053 

%  TOTAL  SSB= 

11.7356 

•  #<p.s  500  PULL  KBRRs 

6.60306 

%  TOTAL  SSB= 

13.6137 

10  #<p>s  550  PULL  KBIIR= 

3.55573  K  TOTAL  SSB= 

0.43373 

SUMMARY  OP  10  RUNS 

1  #<pis  550  RBDU  KBRRs 

4.63633 

%  TOTAL  SSB= 

11.0360 

3  #(p»  550  RBDU  XBRRs: 

6.60306 

%  TOTAL  SSB:. 

11.0657 

3  #«p*=  550  RBDU  KBRR= 

4.63633 

%  TOTAL  SSBs 

0.91530 

4  #«p.3  550  RBDU  KBRRs 

4.63633 

%  TOTAL  SSB= 

0.36451 
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5  #.p.s 

SOO  RBDU  KBRR» 

3.S4«0*  %  TOTAL  SSB:. 

7.17378 

•  #«p.o 

050  RBDU  NBRRa 

4.3T04*  %  TOTAL  SSBs 

10.8573 

T  #.p.a 

5M  RBDU  XBRRa 

4.aT04«  %  TOTAL  SSBs 

9.04195 

•  #.P.= 

600  RBDU  KBRR= 

4.3T04S  K  TOTAL  SSBs 

7.93945 

•  #.P.= 

•SO  RBDU  HBRRs 

•.•1450  %  TOTAL  SSBs 

10.03434 

10  #.pt= 

•to  RBDU  KBRRs 

4.aT04«  %  TOTAL  SSBs 

8.94319 

Cm>r«at  fcaUr«  stl«ct  vector  till 
Dcfteet  of  freedoat  •  233 

FttU  model  miaimem  TOTAL  SSB=  r.34B6333«Tr336 
Redaced  model  mlaimam  TOTAL  SSB=  T.tT3Td30dT4ITl 
LIKBLIHOOD  RATIO  TBST  STATISTIC  Ls  .O.0O4eO19dl61«ft« 

Accept  Redaced  Modeli  SSB'R  {  SSB'P  L  i  P' ALPHA 
RBDUCBD  MODBL  BBCOMBS  PTJLL  MODBL 
APPROPRIATB  mmSBR  OP  MIODLB  NODBS  7 

IMVBSTXGATINO  RBDUCBD  ARCHITBCTURB  NBXT 

///////////////////////////////////////// 

////////////////////////////////////////// 

MODBL  SBLBCTION  XTBRATION  1 

CURRBNT  NUMBBR  MIDDLB  NODBS/PBATURBS  -  PULL  MODBL  7  4 
CUIUIBNT  PBATURB  SBLBCTION  VBCTOR 
1111 

BBST  PULL  MODBLt  TOTAL  SSB=  7.1737«30«74a71 

SUMMARY  OP  10  RUNS 


1  #«P<S 

•50  RBDU  KBRRs 

4.06aai  %  TOTAL  SSBs 

11.4377 

a  #«p«= 

•00  RBDU  XBRRs 

3.S6Sra  %  TOTAL  SSBs 

8.30053 

a  #cpis 

660  RBDU  RBRRs 

4.6aoaa  %  total  ssBs 

11.7895 

4  #cpi= 

TOO  RBDU  XBRRs 

4.ar040  %  TOTAL  SSBs 

9.07790 

0  #«pis 

760  RBDU  KBRRs 

a.aoass  X  total  ssBs 

7.99443 

•  #«p»= 

660  RBDU  KBRRs 

3.01460  %  TOTAL  SSBs 

8.47713 

T  #.p.s 

660  RBDU  XBRRs 

3.aoaS6  %  TOTAL  SSBs 

8.43114 

•  #.P*= 

1000  RBDU  RBRRs 

a.40110  %  TOTAL  SSBs 

4.34473 

0  #«pi= 

660  RBDU  KBRRs 

4.aro44  X  TOTAL  SSBs 

9.13348 

10  #epis 

660  RBDU  KBRRs 

a.S440a  X  TOTAL  SSBs 

7.01343 

Carreat  featare  eelect  vector  1 

111 

Degrees  of  freedom  0  33# 

PaU  model  mlaimam  TOTAL  SSBs  r.l737a30674S71 
Redaced  model  mlaimam  TOTAL  SSB=  4.3447347487145 
LIKBLIHOOD  RATIO  TBST  STATISTIC  Ls  •15.533333419718 
Accept  Redaced  Modeli  SSB'R  ;  SSB*?  L  ;  P' ALPHA 
RBDUCBD  MODBL  BBCOMBS  PULL  MODBL 
APPROPRIATB  NUMBBR  OP  MIDDLB  NODBS  4 

INVBSTIGATING  RBDUCBD  ARCHITBCTURB  NBXT 

///////////////////////////////////////// 

////////////////////////////////////////// 

MODBL  SBLBCTION  ITBHATION  I 

CURRBNT  NUMBBR  MIDDLB  NODBS/PBATURBS  •  PULL  MODBL  6  4 
CimRBNT  PBATURB  SBLBCTION  VBCTOR 
lilt 
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BIST  FULL  MODBLt  TOTAL  SSBs  4,UmiU$7M 

SUlOf  ARY  OF  10  RUNS 


1  #«p.s 

6S0  URDU  KBRRs. 

3.49110 

K  TOTAL  9SR= 

7.41936 

a  #.p.= 

*80  RRDU  XRRRs 

1.77936 

K  TOTAL  SaRs 

6.70470 

s  #«p*= 

500  RRDU  11RRR= 

3.49110 

%  TOTAL  SSRs 

6.66141 

*  #«p*= 

400  RRDU  KRRR= 

3.91469 

%  TOTAL  SSRs 

6.66716 

5  #«p>= 

TOO  RRDU  KRRR= 

4.37046 

%  TOTAL  aSRs 

6.60496 

•  #«P»= 

550  RRDU  RRRR= 

3.64696 

%  TOTAL  aaR= 

6.63693 

T  #«P*= 

450  RRDU  KRRRs 

3.91469 

K  TOTAL  aaR= 

6.30973 

•  #«P»= 

550  RRDU  KRRR.. 

4.37046 

%  TOTAL  aaRs 

10.03347 

9  #.p.= 

1400  RRDU  NRRRs 

3.13595  %  TOTaL  SSRs 

6.49657 

10  #cpi= 

:  450  URDU  KRRR= 

4.3T044  %  TOTAL  SSR= 

11.6339 

Cvtteal  fe*lmr«  s«l«ct  vector  1111 
Degreee  of  freedom  0  344 

Fall  model  mUimam  TOTAL  SSBs  4.3«4T34r4«ri06 
Redaced  model  mlalmam  TOTAL  SSB:b  5.T047030045141 
LIKBLIHOOD  RATIO  TBST  STATISTIC  L=  13.4t4«1531034t 
Alpkal  s:  5.0000000000000D-03 
RBJBCT  RBOUCBD  MODBL 

APPROPRIATB  NUACBBR  OP  MIDDLB  NODBS  0 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiitiiiiii 

iiiHiiuii/ii/iiiiuiiiiiiiiiiiiiiiiiim 

PINAL  NUMBBR  MIDDLB  NODBS/PBATURBS  6  4 
STOP=  I 

PROGRAM  COMFLBTB 


B.2  Application  of  the  Feature  Selection  Algorithm  on  the  FLIR  Problem 

OVBRALL  NBURAL  NBTWORK  SBLBCTIONi  AUDIT  TRAIL 


Namber  of  ANN  ileratioae  to  Sad  good  local  mia 
IB  arckitectare  aad  fealate  modaler  5 

Namber  of  ANN  iteratloat  for  talicacp  tcreealag  modale  etatletlce  10 
laitial  model  etractare  U: 

NUBdBBR  OF  INITIAL  kODDLB  NODBS  0 
NUBfBBR  OP  INITIAL  FBATURBS  9 
NUMBBR  OP  OUTPUTS  1 
1  OUTPUTS  USBD  FOR  3  CLASSBS 
NUMBBR  OP  VBCTORSt  TRAIN  •  TBSTi  350  300 
NUMBBR  OF  INITIAL  TRAINING  BP  OCHS  600 

NUMBBR  OF  ADDITIONAL  TRAINING  BPOCHS  BBTWBBN  TBST  SBT  SSB  BVALUATIONS  (m)  60 
IMPROVBMBNT  RBQUIRBD  BBTWBBN  TBST  SBT  BVALUATIONS  FOR  TRAINING  TO  CONTINUB 
(1-C3)9(  =  &.0OOOOB-O39( 

TRAINING  CONTINUBS  IFitf»e[t]  j  TS-SSB[t.m]*C3 
LOG  declialag  learaiag  rales  ased 
MOMBNTUM  STBP  SIKB  C3  =  0.300000 

MAXIMUM  NUMBBR  OP  FINAL  PBATiniBS  TO  BB  SBLBCTBDi  0 

STATISTICAL  SIGNIF  LBVBLSt  ARCB,FBAT,SALs  6.0000000000000D-03  6.0000000000000D>03 

6.0000000000000D-03 
RANDOM  NUMBBR  SBBD  19.0000 
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//////////////////////////////////////////// 

/////////////////////////////////////// 


'INITIAL  ARCHITBCTURB  SBLBCTION  llODULBt' 


MODBL  SBLBCTION  ITBRATION  1 

CURRBNT  NUMBBR  MIDDLB  NODBS/PBATURBS  -  FULL  MODBL  •  • 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
111111111 


SUMMARY  OP  S  RimS 


1  #ep«o:  660  PULL  KBRR= 

3.09091  %  TOTAL  SSBs 

16.9041 

3  #.ps=  660  PULL  KBRR= 

4.64645  %  TOTAL  SSBa 

19Jt941 

3  #.p.=  600  PULL  KBRRs 

6.37373  %  TOTAL  SSBs 

97.1461 

4  #.p.3:  660  PULL  11BRR= 

3.61616  %  TOTAL  SSBa 

14.9434 

6  #.p.=  600  PULL  KBRR= 

4.00000  %  TOTAL  SSBs 

19.6461 

SUMMARY  OP  8  RUNS 

1  #.pis  600  RBDU  )iBRR= 

4.16163  %  TOTAL  SSB= 

30.1310 

3  #.pi=  660  RBDU  KBRRs 

3.61616  %  TOTAL  SSBs 

17.0346 

3  #.p.s  660  RBDU  WBRR= 

4.00000  %  TOTAL  SSBs 

31.1646 

4  #.p.=  600  RBDU  KBRRs 

3.37373  %  TOTAL  SSB= 

14.3370 

6  #.p.=  660  RBDU  KBRR= 

3.09091  %  TOTAL  SSBs 

17.3434 

C«rr«mt  telecl  vector  111111111 

Deftcee  of  free4oai  11  fSl 

PeU  model  mUlmem  TOTAL  SSBs  lB.90e06Mr3«6r 
Redaced  model  mUlmam  TOTAL  SSBs  14.334994303333 
LQCBLIHOOD  RATIO  TBST  STATISTIC  Ls  1.1090486443404 
Alpkals  6.0000000000000D-03 
Accept  Redaced  Model)  L  {  P'alpha 
RBDUCBD  MODBL  BBCOldBS  PULL  MODBL 
APPROPRIATB  NUMBBR  OP  MIDDLB  NODBS  7 


INVBSTIGATING  RBDUCBD  ARCRITBCTURB  NBXT 


iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

////////////////////////////////////////// 


MODBL  SBLBCTION  ITBRATION  1 

CURRBNT  NUMBBR  MIDDLB  NODBS/PBATURBS  -  FULL  MODBL  T  9 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
111111111 

BBST  PULL  MODBL:  TOTAL  SSB=  16.339994303223 

SUMMARY  OP  6  RUNS 


1  #cp.=  600  RBDU  XBRRs 
3  #epi=  650  RBDU  KBRR= 

3  #.pi=  660  RBDU  KBRR= 

4  ^^mpt=  600  RBDU  NBRRs 
6  #.pa=  600  RBDU  KBRR:. 

Cmii.ll  (..lit.  Ml.el  T.c«or  1 
Dift...  of  fcdom  11  473 


3.46466  N  TOTAL  SSBs  16.1393 

4.00000  K  TOTAL  SSBs  17.6363 

4.90909  K  TOTAL  SSB=  33.6369 

3.61616  K  TOTAL  SSB=  16.6636 

3.61616  N  TOTAL  SSBs  16.4436 

11111111 
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F«U  aoM  aUUKaa  TOTAL  SSIa  1«  JMtMMmS 
K*4ac«4  modal  mUiamm  TOTAL  SSBa  ld.tStl*lt3«Mt 
LOOLmoOD  RATIO  TRST  STATISTIC  La  -O.dSSMdMdSSSlM 
Accofl  Radocod  Madali  SSRH  |  SSR?  L  |  P‘ ALPHA 
RBOOCRO  MODBL  BRCOldBS  PULL  UOORL 
APPROPRIATR  NUMBBR  OP  IfIDDLa  NODRS  S 

mvasTiOATiNa  rboucbo  arcritbcturb  nbxt 

///////////////////////////////////////// 

////////////////////////////////////////// 

IIODBL  SBLBCTION  ITBRATION  1 

CURRBNT  NUlfBBR  IdIDDLB  NODBS/PBATURBS  •  PULL  MODBL  d  • 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
111111111 

BBST  PULL  MODBLi  TOTAL  SSBa  Id.lSSldlBSdSSS 

SUMMARY  OP  S  RUNS 


1  #apla 

BOO  RBDU  KBRRa 

4.B4B4B  N  TOTAL  SSBa 

a0aai4i 

s  #ap»= 

BOO  RBDU  KBRRa 

4.B0B0B  N  TOTAL  SSBa 

aa.arift 

S  #•»•= 

BBO  RBDU  NBRRa 

4.3B3B4  N  TOTAL  SSBa 

19,7743 

4  #ap>a 

BOO  RBDU  NBRRa 

4.30304  N  TOTAL  SSBa 

19,93^4 

B  #apia 

BBO  RBDU  NBRRa 

4.00900  N  TOTAL  SSBa 

71.9994 

CvifiBi  select  1 

11111111 

Oaftaaa  at  ftaadoa  11  4SS 

PaU  aadat  aialaaa  TOTAL  SSBa  IS.lSSlSlSaSSBS 
Radacad  aadal  aialaaa  TOTAL  SSBa  IS.raMSTSSBlSt 
LDCBLIHOOD  RATIO  TBST  STATISTIC  La  S.TSSSIOSIISISS 
AIpkata  S.OOOOOOOOOOOOOD-OS 
RBJICT  RBDUCaO  MODBL 


APPROFRIATB  NUMBBR  OP  SODDLB  NODBS  S 

iiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiii 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATURBS  S  S 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
011111111 


SUMMARY  OP  6  RUNS 


1  #apta  BBO  RBDU  KBRRa 
a  #apaa  BOO  RBDU  NBRRa 

3  #apia  BBO  RBDU  KBRRa 

4  #epia  BOO  RBDU  KBRRs 
6  #apaa  BBO  RBDU  KBRRa 


B.3T3T3  K  TOTAL  SSBa  34.BrSB 

B.3raT3  K  TOTAL  SSBa  3B.13BT 

B.S1S1B  K  TOTAL  SSBa  aB.aTBT 

S.00000  N  TOTAL  SSBa  34.1030 

B.tOBOO  H  TOTAL  SSBa  30.0030 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATURBS  0  B 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
101111111 


SUMMARY  OP  B  RUNS 
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t  #•»»  MO  RBOD  NaRR>i 
»  SSO  UDU  KBBB» 

S  #€pta  OM  RBDU  MRRa 
4  #«pM  MO  RBDU  MaR> 
0  #«p»  000  URDU  KBRRa 


0.00001  %  TOTAL  SSBa 

o.anro  N  totai.  ssr> 

O.MMO  %  TOTAL  SSBa 

a.anTs  %  total  S8B« 
o.toioa  %  TOTAL  SSBo 


RBDUCBD  MODBL  UmOLB  NODBa/FBATURBS  0  0 
CANOIDATB  PBATURB  SBLBCTION  VBCTORi 
llOlltItl 


SUlOf  ARY  OP  6  Rims 


1  #«pts  050  RBDU  RBRRs 
a  #*pt3  000  RBDU  KBRRs 
a  #«p»  000  RBDU  KBRR> 
4  #«p>3  000  RBDU  RBRR> 
0  #«pas  000  RBDU  KBRRs 


4.0OM0  K  TOTAL  SSBs 

o.araTa  %  total  ssbs 

0.04040  %  TOTAL  SSBs 
0.40400  K  TOTAL  SSBs 
3.01010  %  TOTAL  SSB= 


RBDUCBD  MODBL  MIDDLB  MODBS/PBATURBS  4  0 
CAMDIDATB  PBATURB  SBLBCTION  VBCTORi 
111011111 

SUMMARY  OP  0  RUNS 


1  #<pi=:  500  RBDU  RBRR= 
a  #«pisi  000  RBDU  KBRRs 
a  #«pt=  000  RBDU  RBRR= 
4  #<p«s  000  RBDU  MBRRs 
0  #«pts  MO  RBDU  RBRR= 


T.OOOOl  %  TOTAL  SSB= 

4.nrar  %  total  8sb= 
a.oaoao  %  total  ssbs 
s.rarar  %  total  ssbs 
o.rarar  %  total  ssb^ 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATURBS  0  0 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
111101111 


SUMMARY  OP  0  RUNS 


1  #«ps>i  5M  RBDU  RBRR>: 
a  #«pas  550  RBDU  KBRRia 

3  #«p>=  000  RBDU  KBRRs 

4  #<p>=  000  RBDU  RBRRs 
0  #«pt=  TOO  RBDU  KBRR= 


r.40400  N  TOTAL  SSBs 
0.03M0  %  TOTAL  SSB= 
7.03030  %  TOTAL  SSBs 
0.34344  %  TOTAL  SSBs 
O.lOlOa  %  TOTAL  SSBs 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATURBS  0  0 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
111110111 


SUMMARY  OP  0  RUNS 


1  #«p>=  000  RBDU  XBRRs 
a  #«pa=  OM  RBDU  KBRRs 

3  #«pK  OM  RBDU  XBRRn 

4  #cptai  000  RBDU  NBRRi: 
0  #«p*=  TOO  RBDU  KBRRs 


0.00040  %  TOTAL  SSBn 
4.00000  %  TOTAL  SSB> 
4.TaTaT  %  TOTAL  SSBa 
4.00000  X  TOTAL  8SB>i 
4.1SlSa  X  TOTAL  SSBai 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATURBS  0  0 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
111111011 


aa.iaM 

aaA4T3 

34.3434 

30.1004 

33.0004 


33.3000 

33.3M4 

3S.4M0 

30A31T 

10.40M 


34.M30 

33.1033 

14.0440 

30.3100 

31.3M1 


33.0307 
34  .MOT 
30.30M 
30.3370 
3TA303 


30.1043 

31.4331 

30.4073 

3141170 

30.aiM 
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SUMMARY  OP  t  RUNS 


t  #«pa>  SM  RBDU  KHRR> 
3  #«p»i  tSO  RRDU  NRRRn 

3  #«pis  6M  RRDU  MRR> 

4  #tpKi  550  RRDU  RRRRs 

5  ptpia  550  RRDU  NRRRs 


5.00000  R  TOTAL  SSRa  30.5045 

4.50000  %  TOTAL  SSR>  33.1040 

5.00001  %  TOTAL  SSR>  34.0333 

4.15153  %  TOTAL  SSR_  30.T350 

5.54545  %  TOTAL  SSR>  30.1555 


RRDUCRD  MODRL  51IDOLR  NODRS/PRATURRS  5  5 
CANDmATR  PRATURR  SRLRCTION  VRCTORi 
111111101 


SUMMARY  OP  5  RUNS 


I  #<p>s  550  RRDU  RRRRs 
3  #«pt=  550  RRDU  KRRRs 

3  #«p>=  500  RRDU  RRRRs 

4  #*pts  550  RRDU  KRRRs 

5  #€pt3  550  RRDU  KRRRs 


3.00001  R  TOTAL  SSRc  15.5537 

4.00000  R  TOTAL  SSRs  10.5350 

5.35354  R  TOTAL  SSR=  31.3105 

5.37373  R  TOTAL  SSR=  31.7755 

3.45455  R  TOTAL  SSRs  15.5155 


RRDUCRD  MODRL  MmDLR  NODRS/PRATURRS  5  5 
CANDmATR  PRATURR  SRLRCTION  VRCTORt 
111111110 

SUMMARY  OP  5  RUNS 


1  #«p«= 

550  RRDU  NRRRs 

4.15153  %  TOTAL  SSR= 

3  #<pis 

550  RRDU  KRRRs 

4.73737  %  TOTAL  SSRs 

33.3330 

3  #«pts 

550  RRDU  KRRRs 

5.54545  %  TOTAL  SSRs 

31.4004 

4  #«pis 

750  RRDU  KRRRs 

3.45455  %  TOTAL  SSRs 

ir.03»3 

5  #*p4= 

700  RRDU  KRRRs 

5.35354  R  TOTAL  SSRs 

37.3071 

PRATURR 

1 

D«Ki4«t  of  fiooOoai  5  453 

PsU  aodol  aiUiBiaBi  SSRs  15.130151535350 

Rodacod  Bodol  Blaiiiaai  SSR=  10.553500507535 

LDCRLIHOOD  RATIO  TRST  STATISTIC  Ls  3.0575575703377 

Alpka3  dcsiiodss  5.0000000000000D-03 

ACCRPT  RRDUCRD  MODRLi  PNVAR  |  CNVAR  aad  L  |  P'olpka 


iimiiiiiiiiiniiiimniiiitniiiiiiiii 

iiiiiimiiiiiiiiiiiiiiiiiiiiiiiiiimiiii 


STOPS  0 


•■••••••••••STRUCTURR  SUBMODULR  NRXTi  •••••••••••••••••• 

MODRL  SRLRCTION  ITRRATION  3 

CURRRNT  NUMBRR  SdmDLR  NODRS/PRATURRS  ■  PULL  MODRL 
CURRRNT  PRATURR  SRLRCTION  VRCTOR 
111111101 

BRST  PULL  MODRLi  TOTAL  SSRs  15.553550557530 

SUMMARY  OP  5  RUNS 

I  #apts  550  RRDU  KRRR-  4.00000  %  TOTAL  SSR«  17.4303 

3  #ap»  500  RRDU  RRRRm  3.54545  K  TOTAL  SSRs  14A535 

3  #«p4a  500  RRDU  KRRRs  3.45455  K  TOTAL  SSR-  15.3730 


■ITRRATIONa  3 


5  0 
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4  #•»»  tM  aaou  %KUim  T.anT3  %  total  isb-  m.mm 

5  #•»»  400  BBOU  NBBB>  •.•tUt  K  TOTAL  ISBa  M.TMS 

Cams!  faalBia  HlacI  aaciat  lllllllOl 
Dafiaaa  at  ftaaAaa  10  400 

VaU  atadal  aUiaaM  TOTAL  SSB>  10.10300000 Toao 
BaOacaO  aiaOal  BUiBaai  TOTAL  SSBa  14.M313000000T 
LOCBLmOOO  BATIO  TBST  STATISTIC  L>  •1.0114130300030 
Accapi  Btdacad  llaOali  SSBH  t  SSB'P  L  )  F* ALPHA 
BBOUCBD  llOOBL  BBCOIOIS  PULL  MOOBL 
APPBOPBIATB  MUSIBBB  OP  llIDDLa  MODBS  1 

///////////////////////////////////////// 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiimiiiimii 


■PBATURB  SUBMOOULB  NBXTi 


'•lTBBATION>  3 


RBDUCBD  MODBL  IfIDDLB  NODBS/PBATUBBS  1  T 
CANOIOATa  PBATURB  SBLBCTION  VBCTORi 
011111101 


Sinof  ary  op  1  RUNS 


1  #apas  110  RBOU  KBRR= 
a  #apt=  000  RBOU  HBBRa 

3  #apt=  MO  BBDU  11BBR= 

4  #apts  000  RBDU  HBRRa 
1  #apt=  MO  RBDU  KBRRa: 


r.3TaT3  %  TOTAL  SSBs 
0.10103  %  TOTAL  SSB= 
0.00000  %  TOTAL  SSBs 
0.30304  H  TOTAL  SSBa 
0.00000  %  TOTAL  SSBai 


RBDUCBD  ItODBL  IHDDLB  NODBS/PBATURBS  1  T 
CANDIOATB  PBATURB  SBLBCTION  VBCTORi 
101111101 


SUIOIARY  OP  1  RUNS 


1  #ap>=  000  RBDU  KBRR= 
3  #apas  010  RBDU  HBRRsi 

3  #apiis  TOO  RBDU  KBRRa: 

4  #apt=  MO  RBDU  KBRR= 
1  #apa3  110  RBDU  KBRRs 


0.00000  N  TOTAL  SSBs 

o.rarar  K  total  ssb= 

4.00000  %  TOTAL  SSBs 
1.01010  %  TOTAL  SSBs 
T.OOOOl  K  TOTAL  SSBai 


RBDUCBD  MODBL  IfIDDLB  NODBS/PBATURBS  1  T 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
110111101 


SUMMARY  OP  1  RUNS 


1  #apt>  MO  RBDU  KBRRat 
a  #apts  000  RBDU  KBRRa 

3  #apts  MO  RBDU  KBRRs 

4  #apas  000  RBDU  KBRRs 
1  #apiai  IM  RBOU  KBRRai 


1.11103  H  TOTAL  SSBs 
4.14041  H  TOTAL  SSB« 

o.raTsr  K  total  ssb^ 

4.14141  N  TOTAL  SSBa 
1.41411  %  TOTAL  SSBa 


RBDUCBD  MODBL  IfIDDLB  NODBS/PBATURBS  1  T 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
111011101 


30.00M 

aoAori 

31.3M4 

33.1343 

34B414 


37.3033 

30.0013 

17.3077 

33.4000 

34.0011 


30.3073 

30.3300 

30.00M 

33.3000 

33.0300 
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SUMMARY  OF  •  »inn 


1  #•»•»  M«  KBDU 

a  #•*•>  WM  EBOV  KHUl> 
a  TM  KBOU  KBUls 

4  #•»•>  M«  iuu>u  nwauLm 

4  #•»•«  MO  RBOU  naailx 


i.04Ma  %  TOTAL  ssa>i  ar.aooo 
4.04M»  K  TOTAL  ISB»  aOJtOOO 
t.toooo  %  TOTAL  MEa  H-OTM 
•Joao4  %  TOTAL  ssia  M.iaao 
a.4A4M  K  TOTAL  ISBa  IT.OrOt 


EBOUCBD  MODEL  MIDOLB  MODBS/FBATUEBS  »  T 
CAMDIDATB  FBATUEB  SELECTION  VBCTOEi 
lltlOllOl 


SOMMAEY  OF  »  BUNS 


t  0*p*=  OM  EBOO  KBEEa 

3  #*p4a  OM  EBOO  KBEEa 
a  #cpta  000  EEDU  KBEEa 

4  #*pta  000  EBDU  MBEEa 

5  #cpia  500  EBDU  KBEEa 


T.arara  %  total  sssa  aa.aaoo 

t.T3T3T  %  TOTAL  SSBa  aT.OOOT 
5.00000  E  TOTAL  ssBa  ao.orat 
0.50000  %  TOTAL  ssBs  ao.ooao 
5.aoao4  K  TOTAL  ssBa  aa.ooao 


EBDUCBD  MODEL  MIDDLE  NODBS/FBATUEES  5  t 
CANDIDATE  FBATOEB  SELECTION  VBCTOEi 
1  I  1  t  1  0  1  0  I 


SUMMAEY  OF  5  BUNS 


1  #4p4a  050  EBDD  KBEEa 

3  #cpaa  540  BEDO  EBEEa 
a  #tp>m  TOO  EBDD  KBEEa 

4  #«p*a  MO  EBDU  EBEEa 

5  #4p»  550  EBDU  KBBEa 


a.oaoa4  K  total  ssbs  ai.aioo 
5.araTa  %  total  ssbs  ao.Moo 

5.00051  %  TOTAL  SSBs  aO.OOM 
T.OMM  %  TOTAL  SSBa  35.4roi 
5.0MaO  %  TOTAL  SSBa  33.7535 


EBDUCBD  MODEL  MIDDLE  NODBS/FBATUEBS  5  T 
CANDIDATE  FBATUEB  SELECTION  VBCTOEi 
lllltlOOl 


SUMMAEY  OF  5  BUNS 


I  #«p(a  550  EBDU  EBBEa 

3  #«p«a  050  EBDU  EBEEa 
a  #«p4a  550  EBDU  EBEEa 

4  #€p*a  550  EBDU  NBEEa 

5  #«p»  050  EBDU  NBBEs 


4.54545  N  TOTAL  SSBa  33.7700 

4.00000  %  TOTAL  SSBa  31.0503 

0.15103  N  TOTAL  SSBa  30.5tM 

7.03030  %  TOTAL  SSBa  35.5400 

5.00091  N  TOTAL  SSBa  34.3773 


EBOUCBD  MODEL  MIDDLB  NODBS/FBATUEBS  5  7 
CANDIDATB  FBATUEB  SELECTION  VBCTOEi 
111111100 


SUMMAEY  OF  5  BUNS 


1  #«p*a 

050  EBDU  KBEEa 

434545  K  TOTAL  SSBa 

3#«P4. 

000  EBDU  NBEEa 

4.73737  %  TOTAL  SSBa 

tt.MM 

»#«F4- 

500  EBDU  KBEEa 

0.10153  %  TOTAL  SSBa 

33.r41« 

4  #«p.- 

750  EBDU  KBBEa 

a.0a0M  %  TOTAL  SSBa 

tr.M09 

5#«p*a 

550  EBDU  KBEEa 

5.09030  %  TOTAL  SSBa 

33.3310 

FBATUEB 

MlMtUa  ilaistlaa  # 

3 
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14.Wa»aMMMT 


of  6««4*a  •  4M 
Pall  ao4al  alataiBa  SSBb 
Radacad  aadat  aOaiaaa  SSR>  ir.erMl«»MtU 
UXaUBOOO  RATIO  TB8T  ITATISTIC  L>  ir.3T0M44aMl« 
Al»kaa  dad>a4>  ».0e4M44«M000D-ea 
ACCEPT  RBDUCRD  MODRLi  PNVAR  t  CNVAR 

///////////////////////////////////////// 

////////////////////////////////////////// 

STOPa  0 


•••••••‘••••STRUCTURR  SUBlfODUI.B  MHXTi  •••••••••••••••••••i<r«aAT10Ka  1 

MODRL  SBLRCTION  ITERATION  3 

CURRENT  NUMBER  MIDDLE  NODES /FEATURES  -  PULL  MODEL  S  T 
CURRENT  FEATURE  SELECTION  VECTOR 
111011101 

BEST  PULL  MODELi  TOTAL  SSEa  11.070014030103 
SUMMARY  OP  5  RUNS 


1  #apaa  SM  REDU  HBRRa 
3  #aota  000  REDU  RERRa 

3  #apaa  TOO  REDU  KERRa 

4  #apia  1300  REDU  NERRa 
4  Eapta  400  REDD  KERRa 


7.00001  %  TOTAL  SSBa  31.0703 

0.37373  %  TOTAL  SSBa  30.4100 

4.04040  %  TOTAL  SSBa  10.3070 

3.10103  %  TOTAL  SSBa  13.0000 
3.01010  %  TOTAL  SSBa  17.4043 


Carnal  laatata  talaal  Taatar  111011101 
Dafraat  of  fraodooi  0  004 

Pall  modal  alalmaai  TOTAL  SSBa  17.070010030103 
Radaaad  modal  mlalmam  TOTAL  SSBa  13.004030470003 
LKBLIBOOD  RATIO  TEST  STATISTIC  La  .10J00047104131 
Aaaool  Radaaad  Modal)  SSEYl )  SSEY  L  ;  P' ALPHA 
REDUCED  MODEL  BECOMES  PULL  MODEL 
APPROPRIATE  NUMBER  OP  MIDDLE  NODES  4 


iimiiiiiiiiiiiiiiiiiiinniiiiiiiiiiiii 

////////////////////////////////////////// 


■FEATURE  SUBMODULE  NEXT) 


'**ITERATIONa  3 


REDUCED  MODEL  MIDDLE  NODES/PBATURES  4  0 
CANDIDATE  FEATURE  SELECTION  VECTOR) 
011011101 

SUMMARY  OP  0  RUNS 


1  #apoa  000  REDU  RBRRa 
3  #oota  000  REDU  NBRRa 

3  #arta  000  REDU  NBRRa 

4  #opaa  000  REDU  NBRRa 
0  #apta  000  REDU  NBRRa 


0.00000  N  TOTAL  SSBa  33.0401 
7.00001  N  TOTAL  SSBa  33.3001 
0.73737  N  TOTAL  SSBa  30.3371 
0.40400  N  TOTAL  SSBa  43.0003 
0.00000  N  TOTAL  SSBa  31.0730 


REDUCED  MODEL  MIDDLE  NODES/PBATURES  4  0 
CANDIDATE  FEATURE  SELECTION  VECTOR) 
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tOlOlll*! 


aOMMARY  OP  5  RONS 


1  #4p.> 

SM  RBDU  RBRRs 

S.OeSM  %  TOTAL  SSB> 

*  #«P4- 

SM  RBDU  NBRR> 

S.0SM1  %  TOTAL  SSBs 

a3.43M 

3  #*pts 

SSO  RBDU  RBRRa 

s.anr3  %  total  ssb« 

31.MU 

4  #«pt. 

SM  RBDU  KBRR> 

lo.nrs  R  TOTAL  SSB» 

41.0331 

s  #«P» 

SM  RBDU  KBRR> 

S.S0SM  R  TOTAL  SSBa 

43.OT0f 

RRDOCRD  UODRL  MTOOLB  NODBS/PBATORBS  4  • 
CANDIOATB  PBATORB  SBLBCTIOR  VBCTORi 
tlOOlllOl 

SUUMARY  OP  t  RUNS 


1  5M  RBDU  HBRR- 

a  #*p«=  SM  RBDO  KBRRs 

3  #tp»  SM  RBDO  KBRRs 

4  #<*•-  SM  RBDU  RBRRs 

5  #«p<=  SM  RBDU  RBRRs 


S.S4S4S  K  TOTAL  SSBs  SS.OSSO 

s.nrsr  %  total  ssb^i  34.s4ss 
s.istsa  K  TOTAL  SSBs  3S.03S0 
S.SOSOS  %  TOTAL  SSBa  30.SMS 
4.nnr  %  total  asB>  ao.soss 


RBDUCBD  UODBL  SUDDLB  NODBS/PBATORBS  4  4 
CANDIDATB  PBATORB  SBLBCTION  VBCTORt 
111001101 


SUMMARY  OP  S  RUNS 


1  #tp4-  TM  RBDU  KBRRk 
S  #*p*>  SM  RBDU  %BRR> 

3  #«p«=  SM  RBDU  KBRRc 

4  #€p«=  SM  RBDU  NBRRs 

5  #ep«s  SM  RBDU  KBRRs 


r.OSMl  %  TOTAL  SSBs  M.MS4 
S.S4S4S  %  TOTAL  SSBs  34ASST 
lO.OOOM  %  TOTAL  SSBa  ST.lOtl 
r.0SMl  %  TOTAL  SSBs  33.SSM 
S.S4S4S  %  TOTAL  SSBs  3S.MS1 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATORBS  4  S 
CANDIDATB  PBATORB  SBLBCTION  VBCTOR) 
111010101 


SUMMARY  OP  S  RUNS 


1  #«p*3  SM  RBDU  HBRRs 
a  #«p«=  SM  RBDU  KBRRs 

3  #«p43  SM  RBDU  KBRRc 

4  #«pts  SM  RBDU  KBRRs 

5  #«p43  SM  RBDU  KBRRs 


s.osMi  %  TOTAL  ssBs  as.ssss 
S.S1S1S  %  TOTAL  SSBs  ss.iota 
s.nrar  %  total  ssbs  ss.Mia 

S.OSMI  K  TOTAL  SSBa  aS.MTr 
s.isisa  %  TOTAL  ssBs  as.taM 


RBDUCBD  MODBL  MIDDLB  NODBS/PBATORBS  4  S 
CANDIDATB  PBATORB  SBLBCTION  VBCTORi 
111011001 


SUMMARY  OP  S  RUNS 

1  #4pia  SM  RBDU  NBRRa 
a  #4pt>  SM  RBDU  MRR> 

3  #«p»  SM  RBDU  «BRR> 

4  #«p4>  SM  RBDU  RBRR> 

5  #«p«s  SM  RBDU  NBRRa 


S.S3S3S  %  TOTAL  SSB>  33.3047 
S.lSlSa  %  TOTAL  SSB>  a4.S3M 
SAS3S4  %  TOTAL  SSBii  34.0001 

S.nrar  K  total  ssb>  so.tmi 
s.ann  %  total  ssb>  aa.sMi 
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RBDUCaD  MOOIL  lODOM  MODBS/FIATUUS  4  « 

CANSiDATa  FaATuaa  saLacxioN  vacToai 

ttlOlllOO 


SUMMARY  OF  S  RUNS 

1  #«p>-  SOO  RRDU  RaRRs 
a  #*pta  600  RROU  NRRRa 

3  #<p43  660  RROU  KaRRs 

4  #4pis  too  RROU  KRRR» 
6  #4pi-  660  RRDU  RRRRss 


4.10103  K  TOTAL  SSRa  ir.t440 

6.3T3T3  %  TOTAL  SSRa  30.0040 

r.tiait  %  TOTAL  SSR>  34.3010 

4.64646  R  TOTAL  SSR>  lO.OOTT 

6.46466  R  TOTAL  SSRa  36.r004 


FRATURR  ■«l€ctiom  ll«r«tiom  #  3 

Dcgnu  of  tntdom  4  613 
F«U  Bodd  Biaiaiui  SSR=  13.0000304T0003 
Rtdaccd  model  mialaiaB  SSRa  17.04406043011? 
LIKRLIHOOD  RATIO  TRST  STATISTIC  Lsc  00.000014300300 
Alpkaa  doUteds  6.00000000000000-03 
ACCRPT  RRDUCaO  MODRLi  FNVAR  )  CNVAR 
FINAL  NUMBRR  OF  MIDOLR  NODRS/FRATURRSi  4/  0 
FINAL  FRATURR  SRLRCTION  VRCTOR 
111011100 


///////////////////////////////////////// 

////////////////////////////////////////// 

STOFs  0 


••••••••••••STRUCTURR  SUBMODULR  NRXTi  ••«••«•••••••••••• 

MOORL  SRLRCTION  ITRRATION  4 

CURRRNT  NUMBRR  MIDOLR  NODRS/FRATURRS  ■  FULL  MOORL 
CURRRNT  FRATURR  SRLRCTION  VRCTOR 
111011100 

BRST  FULL  MOORL:  TOTAL  SSa=  17.044060430117 

SUMMARY  OF  6  RUNS 


1  #ap«a 

too  RROU  NRRRs 

4.00000  %  TOTAL  ssa= 

aa.ioto 

3  #apa3s 

660  RRDU  KRRRs 

6.37373  X  TOTAL  SSa= 

aa.sMT 

3  Rtpts: 

too  RRDU  NRRRs 

4.00000  N  TOTAL  SSRs 

lf.4«46 

4  #ap4s 

060  RROU  KRRRs 

7.03030  N  TOTAL  SSRs 

3l.t431 

6  #cpi= 

660  RRDU  RRRRs 

0.30304  X  TOTAL  SSR:: 

3r.33n 

Carreal  ■«l«cl  Tcclor  111011100 

D4(Te«  of  fioodoM  0  617 

Fall  Bodol  Blaimaai  TOTAL  SSRs  17.044060430117 
Rodaead  aiedcl  mlalmaB  TOTAL  SSR=  10.404667613130 
LDCRLIHOOD  RATIO  TRST  STATISTIC  Ls  6.0740000774000 
AIpkals  6.00000000000000-03 
RRJRCT  RROUCaO  MOORL 

APPROPRIATR  NUMBRR  OF  mOOLR  NODRS  4 

///////////////////////////////////////// 

////////////////////////////////////////// 


'ITRRATION>:  4 


4  0 
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FINAIi  NUMBSR  MIDOI.>  NODRS/FBATinUSi  4  • 
FINAL  FBATURB  SBLBCTION  VBCTOR 
111011100 


///////////////////////////////////////// 
////////////////////////////////////////// 
PROaRAlf  COMFLBTB 


B.S  Application  of  the  Comprehensive  Neural  Network  Selection  Methodology  on  a  Modified  XOR 
Problem 

OVBRALL  NBURAL  NBTWORK  SBLBCTIONi  AUDIT  TRAIL 


N«aib«r  of  ANN  iteroliosA  to  iad  food  local  ml* 
la  atckitoctafo  aad  foalato  aiodaUt  6 

Naaibot  of  ANN  Itoralioaa  fot  talicacy  tcrooalaf  aodaU  •tatUtict  10 
laitial  model  tlraclaro  !■> 

NIYMBBR  OP  INITIAL  IflDDLB  NODBS  « 

NUMBBR  OP  INITIAL  PBATURBS  6 
NUldBBR  OP  OUTPUTS  1 
1  OUTPUTS  USBD  POR  3  CLASSBS 
NUldBBR  OP  VBCTORSi  TRAIN  .  TBSTt  300  300 
NUMBBR  OP  INITIAL  TRAININQ  BPOCBS  500 

NUldBBR  OP  ADDITIONAL  TRAININQ  BPOCBS  BBTWBBN  TBST  8BT  SSB  BVALUATIOM8  (»)  50 
ndPROVBMBNT  RBqUIRBD  BBTWBBN  BVALUATIONS  POR  TRAININQ  TO  CONTINUB  (l.C3)X 
O.OOOOOB-OSK 

TRAININQ  CONTINUBS  lPiti-H«[t]  J  TS-SSB(t-m]*C3 
LOQ  docUalaf  loaraiaf  fatoe  aaod 
MOMBNTUld  STBP  SIBB  C3  =  0.300000 

ONLY  TBB  STATISTICALLY  DfOICATBD  PBATURBS  ARB  RBICOVBD 

STATISTICAL  SIQNIP  LBVBLS;  ARCB^BAT,SAL=  .  'MlOOOOOOOOOOOD-03  5.0000000000000D-03 
5.0000000000000D-03 
RANDOM  NUMBBR  SBBD  14.0000 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

••••••••••••INITIAL  ARCHITBCTURB  SBLBCTION 

MODBL  SBLBCTION  ITBRATION  1 

CURRBNT  NUMBBR  MIDDLa  NODBS/FBATURBS  •  FULL  MODBL  6  6 
CURRBNT  FBATURB  SBLBCTION  VBCTOR 
11111 

SUMMARY  OF  S  RUNS 


I  #«p»= 

550  FULL  KBRRs 

1.60000  %  TOTAL  SSBa 

7.fllt4 

2  #.p.= 

550  FULL  RBRRs 

3.20000  %  TOTAL  SSBs 

lo.rrts 

*  #«P»= 

TOO  FULL  NBRR> 

3.60000  K  TOTAL  SSBs 

15.33M 

4  #*pi=: 

600  FULL  NBRRa 

3.60000  X  TOTAL  SSBa 

10.33MO 

*  #«P»= 

550  FULL  KBRR= 

2.20000  %  TOTAL  SSBs 

10.075M 

SUMMARY  OF  6  RUNS 
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t  MO  UDU  %mURm  3.00000  %  TOTAL  SSS«  lO.MU 

a  #•»•»  too  MOU  KURm  1.40000  %  TOTAL  SSBs  0.00073 

3  i^^fm  000  RBDU  1.00000  %  TOTAL  SSBs  10.44300 

4  000  UDU  MRR>  3.00000  %  TOTAL  S8B>  0.34700 

0  OM  UDU  MU«  4.40000  %  TOTAL  SSB«  14.0007 

C«rr«at  feftlvM  vccUr  11111 

D«f  t««»  of  fTooOoai  7  407 

FmU  boAoI  mlalfliBa  TOTAL  SSB«  7.0110340043030 
BoAmcod  Modol  ■OaiaoB  TOTAL  SSBs  0.0007344047737 
LQCBLXHOOD  RATIO  TBST  STATISTIC  Ls  1.4703134404004 
AlpOols  o.oooooooooooooD-oa 
Accopl  Rodocod  Modoli  L  j  F* ALPHA 
UDUCBD  ifODBL  BBCOMBS  FULL  MOOBL 
APPROPRIATB  NUICSBR  OF  MIDDLB  NODBS  0 

DfVBSTIGATlNG  UDUCBD  ARCHITBCTUU  NBXT 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

////////////////////////////////////////// 

MODBL  SILBCTION  ITBRATION  i 

CURRBNT  NUMBBR  MIOOt.B  NODBS/PBATURBS  •  PULL  MODBL  6  S 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
1  1  1  1  1 

BBST  PULL  MODBLi  TOTAL  SSB=  •.OSOnMS.TTar 

SXIMMARY  OP  5  RUNS 

1  #«p.s  <00  RBDU  XBRRs  3.40000  %  TOTAL  SSBs  lO.OTOl 

3  #ep.=  «0  RBDU  XBRRs  1.00000  %  TOTAL  SSB=  9.<0403 

3  #.p.=  «0  RBDU  XBRR=  1.40000  %  TOTAL  SSB=  <.SM0T 

4  #.p.=  <50  RBDU  RBRRs  3.30000  %  TOTAL  SSB=  14.3301 

<  #.p.=  TOO  RBDU  XBRR=  1.40000  %  TOTAL  SSBs  <.3T500 

C.n.Bl  f..tmrc  select  .ectox  11111 
Degxees  of  fxeedom  7  444 

P.U  model  mUlmem  TOTAL  SSB=  4.0907344547737 
Redeced  model  mlalmem  TOTAL  SSBs  4.3340494444074 
LOCBLIHOOD  RATIO  TBST  STATISTIC  Ls  3.0344133471304 
Alphals  5.0000000000000D.03 
RBJBCT  RBDUCBD  MODBL 

APPROPRIATB  NUMBBR  OP  MIDDLB  NODBS  5 

///////////////////////////////////////// 

////////////////////////////////////////// 


SALIBNCY  SCRBBNDfG  MODULBi' 


SUMMARY  OP  10  TRAOnNG  RUNS 

1  #<pem  540  RBDU  KBRRxx  3.40000  %  TOTAL  SSBs  10.34703 

3  #4pem  700  RBDU  KBRRm  1.40000  %  TOTAL  SSB^  7.44934 

3  #4psm  460  RBDU  KBRR.:  1.40000  X  TOTAL  SSBa  9.14494 
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4  #.r.3 

too  RBDU  RBRR> 

3.30000  K  TOTAL  SSB= 

6.133M 

0  #«p.s 

000  RBDU  BBRR- 

3.00000  K  TOTAL  SSBs 

9.61334 

0  #«p.a 

700  RBDU  1tBRR> 

1.00000  %  TOTAL  SSBs 

9.96164 

7 

000  RBDU  NBRR= 

0.00000  %  TOTAL  SSBs 

34.6666 

•  #«p*= 

700  RBDU  BBRRs 

9.00000  K  TOTAL  SSBs 

14.7960 

0  #«pas 

000  RBDU  NBRRs 

9.40000  N  TOTAL  SSa= 

11.3666 

10 

000  RBDU  BBRR> 

3.40000  %  TOTAL  SSBs 

9.06073 

KUMBBR  OP  NOISB-LOCB  FBATURBS  RBIiOVBD  WXTB  SALIBNCY  SCRBBNXNG  3 
FBATURB  SBLBCTION  VBCTOR  AFTBR  SALIBNCY  SCRBBNINO 
110  0  1 

///////////////////////////////////////// 

////////////////////////////////////////// 


••••••••••••STRUCTURB  SUBMODULB  NBXTi  •••••••••••••••••••ITBRATIONss  1 

MODBL  SBLBCTION  ITBRATXON  1 

CURRBNT  NUMBBR  MIDDLB  NODBS/FBATURBS  -  FULL  MOOBL  6  3 
CURRBNT  FBATURB  SBLBCTION  VBCTOR 
110  0  1 

SUMMARY  OF  5  RUNS 


1  #•»•= 

700  FULL  NBRR= 

1.50000  K  TOTAL  SSB= 

7.64366 

3  #.p.= 

700  FULL  BBRR=: 

1.00000  %  TOTAL  SSBs 

6.00643 

3  #«pi= 

500  FULL  BBRRs 

1.50000  %  TOTAL  SSB.: 

6.06636 

4  #«p>s 

750  FULL  BBRRs 

3.00000  %  TOTAL  SSB= 

6.76769 

0  #.p.= 

550  FULL  NBRRs 

0.400000  %  TOTAL  SSBs 

7.63366 

SUMMARY  OF  6  RUNS 

1  #cp.s 

500  RBDU  BBIUl= 

17.0000  B  TOTAL  SSB= 

40.9630 

3  #ep.= 

550  RBDU  XBRRs 

3.50000  X  TOTAL  SSB= 

13.6463 

3  #cp.= 

500  RBDU  KBRRs 

1.40000  B  TOTAL  SSBs 

7.66013 

4  #.p.:. 

700  RBDU  BBRRs 

1.50000  B  TOTAL  SSB= 

7.36633 

»  #«P»= 

500  RBDU  BBRRs 

1.00000  B  TOTAL  SSB= 

7.66410 

CttrteBt  ■•led  tccIoi  110  0  1 

Degtcct  of  froedoai  0  004 

FaU  modol  miaimaa  TOTAL  SSBs  r.0430S3430r309 
Radaccd  modal  mialmam  TOTAL  SSBs  T.3063344003T10 
LIKBLIBOOD  RATIO  TBST  STATISTIC  Ls  -3.1300333730406 
Accept  Rcdaced  Models  SSB'R  i  SSB*F 
RBDUCBD  MODBL  BBCOMBS  FULL  MODBL 
APPROPRIATB  NUMBBR  OF  MIDDLB  NODBS  4 


XNVBSTIGATINO  RBDUCBD  ARCBITBCTURB 


iiiiiiitiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiiii 

iiiiiiiiiiiiiiiiiiiiiiiiiiniiiiiiiiiiiiii 

MODBL  SBLBCTIOM  ITBRATIOM  1 

COBMNT  miMBIR  MIDDLB  NODBS/FBATURBS  ■  FTn.L  MODBL  4  9 
CURRBNT  FBATURB  SBLBCTION  VBCTOR 
110  0  1 

BBST  FULL  MODBLi  TOTAL  SSBs  1.3053944063710 
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SUMMARY  OP  S  RUNS 


1  #«p>= 

550  RBDU  XBRR= 

5.00000 

%  TOTAL  SSBs 

14.3314 

9  #«P»= 

950  RBDU  KBRRa 

14.4000 

%  TOTAL  SSBa 

43.4749 

S  #«P*= 

550  RBDU  XBRR= 

4.00000 

%  TOTAL  SSB= 

17.5734 

4  #<piis 

500  RBDU  XBRR= 

14.3000 

%  TOTAL  SSB= 

40.4370 

5  #.pi= 

500  RBDU  KBRRi. 

4.00000 

H  TOTAL  S3B= 

14.3334 

CvrremI 

kl«x«  select  vector  1 

10  0  1 

D«fr««i  of  fioedom  6  4S9 

P«U  modol  mUlaoni  TOTAL  SSBs  r.3a6a3440fta71« 

Redmced  model  raUimom  TOTAL  SSB=  IC.aai 757134445 

LXKBLIHOOD  RATIO  TBST  STATISTIC  La  113.33343994444 
Alphol  3  5.00000000000000-03 

RBJBCT  RBOUCBD  MODBL 

APPROPRIATB  NUMBBR  OP  MIDDLB  NOOBS  4 

///////////////////////////////////////// 

////////////////////////////////////////// 


'FBATiniB  SUBUODULB  NBXTi 


■ITBRATIONs  t 


RBDUCBD  MODBL  MmDLB  NOOBS/PBATURBS  4  3 
CANDIDATB  FBATURB  SBLBCTION  VBCTORi 


0  10  0 

1 

SUMMARY  OP  6  RUNS 

1 

1950  RBDU  KBRR= 

1.30000  %  TOTAL  SSB= 

4.49473 

9  #«P.= 

400  RBDU  XBRRs 

1.40000  %  TOTAL  SSB= 

4.41473 

9  #«P«= 

550  RBDU  NBRR= 

4.30000  %  TOTAL  SSBs 

19.1154 

4  #.pt= 

750  RBDU  XBRRs 

3.50000  %  TOTAL  SSB= 

14.3031 

5  #.P.= 

550  RBDU  WBRR= 

1.30000  K  TOTAL  SSB= 

4.34344 

RBDUCBD  MODBL  MIDDLB  NODBS/FBATURBS  4  3 
CANDIDATB  FBATURB  SBLBCTION  VBCTORi 
1  0  0  0  1 


SUMMARY  OF  5  RUNS 

1  #ep>=  500  RBDU  KBRR= 
3  #ep.=  650  RBDU  KBRRs 

3  #cp.=  550  RBDU  KBRRs 

4  #«p.s  550  RBDU  KBRR= 

5  #epi=  550  RBDU  KBRRs 


51.5000  %  TOTAL  SSBs  134.733 

51.5000  R  TOTAL  SSB=  135.0TS 

49.5000  N  TOTAL  SSB=  134.543 

45.3000  N  TOTAL  SSBs  135.595 

45.3000  X  TOTAL  SSB=  134.545 


RBDUCBD  MODBL  MIDDLB  NODBS/FBATURBS  4  3 
CANDIDATB  FBATURB  SBLBCTION  VBCTORi 
110  0  0 


SUMMARY  OF  5  RUNS 


1  #«p«=  750  RBDU  KBRR= 
3  #<pi3  550  RBDU  XBRRs 


3.30000  K  TOTAL  SSBs  14.5345 

17.5000  %  TOTAL  SSBs  50.4545 
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3  #«pia  1330  RSDU  KaiUl> 

4  #«P4>  000  Iiaou  KIRRs 

5  #«p»  1300  RBDU  NBBRs 


1.40000  K  TOTAL  3SBs  TA310T 
3.00000  K  TOTAL  SSBe  ir.4030 
3.30000  K  TOTAL  S»=  10.0T4B 


PBATURB  •€l«ctlam  #  1 

D4(r««t  of  froodom  4  400 

Poll  aodol  mUiaima  SSBs  1.3063344033710 

R«d«c«d  aodol  aiioiaioni  SSBs  3.403T3434333T1 

LKBLIHOOO  RATIO  TBST  STATISTIC  Ls  -14.1331003T4410 

ACCBFT  RBDUCBO  MODBLt  SSB'R  |  SSB'P 


///////////////////////////////////////// 

////////////////////////////////////////// 


STOPS  0 


••••••••••••STRUCTURB  SUBMODULB  NBXTi  •••••••••••••••••••ITBRATION 

MODBL  SBLBCTION  ZTBRATION  2 

CURRBNT  NUIIBBR  UXDOLB  NODHS/PBATURBS  •  PULL  MODBL  4  2 
CURRBNT  PBATURB  SBLBCTION  VBCTOR 
0  10  0  1 

BBST  PULL  MODBLi  TOTAL  SSB=  «.405734a4«SSri 

SUMMARY  OP  S  RUNS 


1  #«p»= 

400  RBDU  KBRR= 

14.3000  K  TOTAL  SSB= 

60.7749 

3  #€P1= 

330  RBDU  XBRRs 

3.40000  K  TOTAL  SSB= 

31.6657 

3  #4pt= 

400  RBDU  «BRR= 

3.40000  %  TOTAL  SSB= 

16.4663 

*  #«P‘= 

330  RBDU  KBRRs 

13.3000  %  TOTAL  SSa::^ 

60.3443 

3  #ep»= 

400  RBDU  XBRRs 

3.40000  %  TOTAL  SSB::: 

16.4379 

Caifcai  ▼•cloi  0  10  0  1 

Dcfrees  of  freedom  4  473 

FaU  model  mlalmam  TOTAL  SSB=  «.406r3434«0a7l 
Redaced  model  mlalmam  TOTAL  SSB=  10.43TM0000077 
LZKBLIHOOD  RATIO  TBST  STATISTIC  Ls  1IO.S0777930350 
Alpkals  5.0000000000000D-03 
RBJBCT  RBDUCBD  MODBL 


APPROPRIATB  NUMBBR  OP  MIDDLB  NODBS  4 


///////////////////////////////////////// 

////////////////////////////////////////// 


■PBATl  ;iB  SUBMODULB  NBXTi 


TTBRATIONs 


RBDUCBD  MODBL  MIDDLB  MODBS/PBATURBS  4  1 
CANDIDATB  PBATURB  SBLBCTION  VBCTORi 
0  0  0  0  1 


SUMMARY  OP  6  RUNS 

1  #«pt=  600  RBDU  KBRRs 
3  #4p>s  300  RBDU  KBRRs 
3  #4pts  000  RBDU  KBRRs 


30.4000  K  TOTAL  SSB=  134.003 

40.3000  K  TOTAL  SSBa  130.103 

30.3000  %  TOTAL  SSBs  134.113 
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4  COO  RBDU  KSaRa  49.4000  %  TOTAL  SSBa  104.000 

0  #tpt«  000  MDU  MRRa  40.3000  %  TOTAL  9SB«  laO.TOO 

UDUCBD  liODBL  UIDDLB  NODBS/FBATURBS  4  1 
CANDIDATB  FBATURB  8BLBCT10N  VBCTORi 
0  10  0  0 


SUMMARY  OP  0  RUNS 


1 

000  RBDU  KBRRs 

4T.OOOO  %  TOTAL  SSBs 

134.097 

3  #.p.= 

000  RBDU  XBRRs 

01.0000  %  TOTAL  SSBa 

134.540 

»  #«P*= 

000  RBDU  NBRRs 

40.4000  %  TOTAL  SSBa 

134.431 

«  #«P«= 

000  RBDU  KBRRs 

40.3000  %  TOTAL  SSBa 

134.710 

0  #«P*= 

000  RBDU  XBRR.S 

4T.0000  %  TOTAL  SSBa 

130.109 

FBATURB 

..l.ctloB  il.r.tloa  # 

3 

D«(Tcca  of  fr««dom  4  4T3 

FmU  model  mUimvm  SSBs  e.400ra434050ri 

Redaced  model  mlaimam  SSBa  134.43007400004 

LnCBLlHOOD  RATIO  TBST  STATISTIC  Ls  3140.9103473471 

AlpOea  detireds  O.OOOOOOOOOOOOOD-03 

RBJBCT  RBDUCBD  MODBL 

FINAL  NUMBBR  OP  MIDDLB  NODBS/PBATURBSi  4/  1 
FINAL  PBATURB  SBLBCTION  VBCTOR 
0  10  0  1 


///////////////////////////////////////// 

////////////////////////////////////////// 

STOP=  I 


SALIBNCY  UODULB  NBXTi  3 


SUMMARY  OF  10  TRAININO  RUNS 


1  #«p*= 

000  RBDU 

XBRRa 

3.00000 

X 

TOTAL 

SSBs 

7.07730 

3  #.p.a 

000  RBDU 

XBRRa 

1.40000 

X 

TOTAL 

8SB= 

0.00913 

3  #«P*= 

050  RBDU 

NBRRa 

4.40000 

X 

TOTAL 

SSBs 

10.0303 

*  #«P*= 

050  RBDU 

XBRRa 

1.00000 

X 

TOTAL 

SSBs 

10.0007 

0  #epia 

550  RBDU 

XBRRa 

4.40000 

X 

TOTAL 

SSB= 

17.3009 

•  #«P*= 

900  RBDU 

XBRRa 

1.40000 

X 

TOTAL 

SSBas 

7.73430 

T  #.pia 

050  RBDU 

XBRRa 

3.30000 

X 

TOTAL 

SSBs 

9.11519 

•  #«p*= 

600  RBDU 

XBRRa 

3.00000 

X 

TOTAL 

SSB= 

9.40393 

0  #.p.a 

550  RBDU 

XBRRa 

0.40000 

X 

TOTAL 

SSB» 

19.1411 

10  #.p.s  «00  RBDU  NBRRs  4.00000  X  TOTAL  SSB.s  10.0100 

FINAL  NUMBBR  MIDDLB  NODBS/FBATURBSi  4  S 
FINAL  FBATURB  SBLBCTION  VBCTOR 
0  10  0  1 

///////////////////////////////////////// 

iiiiiiiiiiiiiiiiiiiiiiiiiiiiiimniiiiiii 

PROGRAM  COMPLBTB 
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